UPDATE of partition key

Started by Amit Khandekaralmost 9 years ago254 messages

amitdkhan.pg@gmail.com

almost 9 years ago

2 attachment(s)

Currently, an update of a partition key of a partition is not allowed,
since it requires to move the row(s) into the applicable partition.

Attached is a WIP patch (update-partition-key.patch) that removes this
restriction. When an UPDATE causes the row of a partition to violate
its partition constraint, then a partition is searched in that subtree
that can accommodate this row, and if found, the row is deleted from
the old partition and inserted in the new partition. If not found, an
error is reported.

There are a few things that can be discussed about :

1. We can run an UPDATE using a child partition at any level in a
nested partition tree. In such case, we should move the row only
within that child subtree.

For e.g. , in a tree such as :
tab ->
t1 ->
t1_1
t1_2
t2 ->
t2_1
t2_2

For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
the UPDATE is fired using t2.

2. In the patch, as part of the row movement, ExecDelete() is called
followed by ExecInsert(). This is done that way, because we want to
have the ROW triggers on that (sub)partition executed. If a user has
explicitly created DELETE and INSERT BR triggers for this partition, I
think we should run those. While at the same time, another question
is, what about UPDATE trigger on the same table ? Here again, one can
argue that because this UPDATE has been transformed into a
DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
there can be a counter-argument. For e.g. if a user needs to make sure
about logging updates of particular columns of a row, he will expect
the logging to happen even when that row was transparently moved. In
the patch, I have retained the firing of UPDATE BR trigger.

3. In case of a concurrent update/delete, suppose session A has locked
the row for deleting it. Now a session B has decided to update this
row and that is going to cause row movement, which means it will
delete it first. But when session A is finished deleting it, session B
finds that it is already deleted. In such case, it should not go ahead
with inserting a new row as part of the row movement. For that, I have
added a new parameter 'already_delete' for ExecDelete().

Of course, this still won't completely solve the concurrency anomaly.
In the above case, the UPDATE of Session B gets lost. May be, for a
user that does not tolerate this, we can have a table-level option
that disallows row movement, or will cause an error to be thrown for
one of the concurrent session.

4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
that is to be moved. So in ExecInitModifyTable(), we call
ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
only during execution time for the very first time we find that we
need to do a row movement. I will think over that, but I am thinking
it might complicate things, as compared to always doing the setup for
UPDATE. WIll check on that.

5. Regarding performance testing, I have compared the results of
row-movement with partition versus row-movement with inheritance tree
using triggers. Below are the details :

Schema :

CREATE TABLE ptab (a date, b int, c int);

CREATE TABLE ptab (a date, b int, c int) PARTITION BY RANGE (a, b);

CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 101)
PARTITION BY range (c);

CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
for values from (1) to (51);
CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
for values from (51) to (101);
.....
.....
CREATE TABLE ptab_1_1_n PARTITION OF ptab_1_1
for values from (n) to (n+m);

......
......

CREATE TABLE ptab_5_n PARTITION OF ptab
for values from ('1905-01-01', 101) to ('1905-01-01', 201)
PARTITION BY range (c);

CREATE TABLE ptab_1_2_1 PARTITION OF ptab_1_2
for values from (1) to (51);
CREATE TABLE ptab_1_2_2 PARTITION OF ptab_1_2
for values from (51) to (101);
.....
.....
CREATE TABLE ptab_1_2_n PARTITION OF ptab_1_2
for values from (n) to (n+m);
.....
.....

Similarly for inheritance :

CREATE TABLE ptab_1_1
(constraint check_ptab_1_1 check (a = '1900-01-01' and b >= 1 and b <
8)) inherits (ptab);
create trigger brutrig_ptab_1_1 before update on ptab_1_1 for each row
execute procedure ptab_upd_trig();
CREATE TABLE ptab_1_1_1
(constraint check_ptab_1_1_1 check (c >= 1 and c < 51))
inherits (ptab_1_1);
create trigger brutrig_ptab_1_1_1 before update on ptab_1_1_1 for each
row execute procedure ptab_upd_trig();
CREATE TABLE ptab_1_1_2
(constraint check_ptab_1_1_2 check (c >= 51 and c < 101))
inherits (ptab_1_1);

create trigger brutrig_ptab_1_1_2 before update on ptab_1_1_2 for each
row execute procedure ptab_upd_trig();

I had to have a BR UPDATE trigger on each of the leaf tables.

Attached is the BR trigger function update_trigger.sql. There it
generates the table name assuming a fixed pattern of distribution of
data over the partitions. It first deletes the row and then inserts a
new one. I also skipped the deletion part, and it did not show any
significant change in results.

parts partitioned inheritance no. of rows subpartitions
===== =========== =========== =========== =============

500 10 sec 3 min 02 sec 1,000,000 0
1000 10 sec 3 min 05 sec 1,000,000 0
1000 1 min 38sec 30min 50 sec 10,000,000 0
4000 28 sec 5 min 41 sec 1,000,000 10

part : total number of partitions including subparitions if any.
partitioned : Partitions created using declarative syntax.
inheritence : Partitions created using inheritence , check constraints
and insert,update triggers.
subpartitions : Number of subpartitions for each partition (in a 2-level tree)

Overall the UPDATE in partitions is faster by 10-20 times compared
with inheritance with triggers.

The UPDATE query moved all of the rows into another partition. It was
something like this :
update ptab set a = '1949-01-1' where a <= '1924-01-01'

For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
with 10,000,000 rows, it took 1min 32sec.

In general, for both partitioned and inheritence tables, the time
taken linearly rose with the number of rows.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update_trigger.sqlapplication/octet-stream; name=update_trigger.sqlDownload

update-partition-key.patchapplication/octet-stream; name=update-partition-key.patchDownload

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a666391..f9da3bd 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1737,7 +1737,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e1589..273120a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -624,6 +624,7 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *already_deleted,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -632,6 +633,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (already_deleted)
+		*already_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -775,6 +779,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (already_deleted)
+					*already_deleted = true;
 				return NULL;
 
 			default:
@@ -877,7 +883,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -986,6 +993,27 @@ lreplace:;
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
 
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	already_deleted;
+
+			ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+					   &already_deleted, canSetTag);
+
+			if (already_deleted)
+				return NULL;
+			else
+			{
+				/*
+				 * Don't update estate.es_processed updated again. ExecDelete()
+				 * has already done it above. So use canSetTag=false.
+				 */
+				return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, false);
+			}
+		}
+
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
@@ -1312,7 +1340,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1582,12 +1610,12 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate, NULL, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1727,7 +1755,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 	/* Build state for INSERT tuple routing */
 	rel = mtstate->resultRelInfo->ri_RelationDesc;
-	if (operation == CMD_INSERT &&
+	if ((operation == CMD_INSERT || operation == CMD_UPDATE) &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 02dbe7b..e9a2e07 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -224,6 +224,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index a1e9255..7f27f51 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -209,13 +209,12 @@ create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to
 create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
 insert into part_a_1_a_10 values ('a', 1);
 insert into part_b_10_b_20 values ('b', 10);
--- fail
+-- fail (row movement happens only within the partition subtree)
 update part_a_1_a_10 set a = 'b' where a = 'a';
 ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
 DETAIL:  Failing row contains (b, 1).
+-- ok (row movement)
 update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
 -- ok
 update range_parted set b = b + 1 where b = 10;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index d7721ed..92603e9 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -119,8 +119,9 @@ create table part_b_10_b_20 partition of range_parted for values from ('b', 10)
 insert into part_a_1_a_10 values ('a', 1);
 insert into part_b_10_b_20 values ('b', 10);
 
--- fail
+-- fail (row movement happens only within the partition subtree)
 update part_a_1_a_10 set a = 'b' where a = 'a';
+-- ok (row movement)
 update range_parted set b = b - 1 where b = 10;
 -- ok
 update range_parted set b = b + 1 where b = 10;

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Khandekar (#1)

Re: UPDATE of partition key

On Mon, Feb 13, 2017 at 7:01 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

parts partitioned inheritance no. of rows subpartitions
===== =========== =========== =========== =============

500 10 sec 3 min 02 sec 1,000,000 0
1000 10 sec 3 min 05 sec 1,000,000 0
1000 1 min 38sec 30min 50 sec 10,000,000 0
4000 28 sec 5 min 41 sec 1,000,000 10

That's a big speedup.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Fetter

david@fetter.org

almost 9 years ago

In reply to: Amit Khandekar (#1)

Re: UPDATE of partition key

On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:

Currently, an update of a partition key of a partition is not
allowed, since it requires to move the row(s) into the applicable
partition.

Attached is a WIP patch (update-partition-key.patch) that removes
this restriction. When an UPDATE causes the row of a partition to
violate its partition constraint, then a partition is searched in
that subtree that can accommodate this row, and if found, the row is
deleted from the old partition and inserted in the new partition. If
not found, an error is reported.

This is great!

Would it be really invasive to HINT something when the subtree is a
proper subtree?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: David Fetter (#3)

Re: UPDATE of partition key

On 14 February 2017 at 22:24, David Fetter <david@fetter.org> wrote:

On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:

Currently, an update of a partition key of a partition is not
allowed, since it requires to move the row(s) into the applicable
partition.

Attached is a WIP patch (update-partition-key.patch) that removes
this restriction. When an UPDATE causes the row of a partition to
violate its partition constraint, then a partition is searched in
that subtree that can accommodate this row, and if found, the row is
deleted from the old partition and inserted in the new partition. If
not found, an error is reported.

This is great!

Would it be really invasive to HINT something when the subtree is a
proper subtree?

I am not quite sure I understood this question. Can you please explain
it a bit more ...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Fetter

david@fetter.org

almost 9 years ago

In reply to: Amit Khandekar (#4)

Re: UPDATE of partition key

On Wed, Feb 15, 2017 at 01:06:32PM +0530, Amit Khandekar wrote:

On 14 February 2017 at 22:24, David Fetter <david@fetter.org> wrote:

On Mon, Feb 13, 2017 at 05:31:56PM +0530, Amit Khandekar wrote:

Currently, an update of a partition key of a partition is not
allowed, since it requires to move the row(s) into the applicable
partition.

Attached is a WIP patch (update-partition-key.patch) that removes
this restriction. When an UPDATE causes the row of a partition to
violate its partition constraint, then a partition is searched in
that subtree that can accommodate this row, and if found, the row
is deleted from the old partition and inserted in the new
partition. If not found, an error is reported.

This is great!

Would it be really invasive to HINT something when the subtree is
a proper subtree?

I am not quite sure I understood this question. Can you please
explain it a bit more ...

Sorry. When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it. Violating a uniqueness
constraint would be one example.

When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.

Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM. If you do something
like

UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...

it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.

Does that make sense, and if so, is it super invasive to HINT that?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: David Fetter (#5)

Re: UPDATE of partition key

On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:

When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it. Violating a uniqueness
constraint would be one example.

When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.

Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM. If you do something
like

UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...

it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.

Does that make sense, and if so, is it super invasive to HINT that?

Yeah, I think it should be possible to find the root partition with
the help of pg_partitioned_table, and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Khandekar (#6)

Re: UPDATE of partition key

On 2017/02/16 15:50, Amit Khandekar wrote:

On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:

When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it. Violating a uniqueness
constraint would be one example.

When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.

Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM. If you do something
like

UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...

it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.

Does that make sense, and if so, is it super invasive to HINT that?

Yeah, I think it should be possible to find the root partition with

I assume you mean root *partitioned* table.

the help of pg_partitioned_table,

The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table. To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent(). By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.

and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.

I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Langote (#7)

Re: UPDATE of partition key

On 16 February 2017 at 12:57, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/02/16 15:50, Amit Khandekar wrote:

On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:

When an UPDATE can't happen, there are often ways to hint at
what went wrong and how to correct it. Violating a uniqueness
constraint would be one example.

When an UPDATE can't happen and the depth of the subtree is a
plausible candidate for what prevents it, there might be a way to say
so.

Let's imagine a table called log with partitions on "stamp" log_YYYY
and subpartitions, also on "stamp", log_YYYYMM. If you do something
like

UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...

it's possible to know that it might have worked had the UPDATE taken
place on log rather than on log_2017.

Does that make sense, and if so, is it super invasive to HINT that?

Yeah, I think it should be possible to find the root partition with

I assume you mean root *partitioned* table.

the help of pg_partitioned_table,

The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table. To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent(). By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.

Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.

and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.

I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.

What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Khandekar (#8)

Re: UPDATE of partition key

On 2017/02/16 17:55, Amit Khandekar wrote:

On 16 February 2017 at 12:57, Amit Langote wrote:

On 2017/02/16 15:50, Amit Khandekar wrote:

On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:

Does that make sense, and if so, is it super invasive to HINT that?

Yeah, I think it should be possible to find the root partition with

I assume you mean root *partitioned* table.

the help of pg_partitioned_table,

The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table. To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent(). By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.

Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.

It is not possible to mix declarative partitioning and regular
inheritance. So, you cannot have a table in a declarative partitioning
tree that is not a (sub-) partition of the root table.

and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.

I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.

What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.

Asking to try the update query with the root table sounds like a good
enough hint. Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Langote (#9)

Re: UPDATE of partition key

On 16 February 2017 at 14:42, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/02/16 17:55, Amit Khandekar wrote:

On 16 February 2017 at 12:57, Amit Langote wrote:

On 2017/02/16 15:50, Amit Khandekar wrote:

On 15 February 2017 at 20:26, David Fetter <david@fetter.org> wrote:

Does that make sense, and if so, is it super invasive to HINT that?

Yeah, I think it should be possible to find the root partition with

I assume you mean root *partitioned* table.

the help of pg_partitioned_table,

The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table. To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent(). By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.

Yeah. But we also want to make sure that it's a part of declarative
partition tree, and not just an inheritance tree ? I am not sure
whether it is currently possible to have a mix of these two. May be it
is easy to prevent that from happening.

It is not possible to mix declarative partitioning and regular
inheritance. So, you cannot have a table in a declarative partitioning
tree that is not a (sub-) partition of the root table.

Ok, then that makes things easy.

and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.

I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.

What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.

Asking to try the update query with the root table sounds like a good
enough hint. Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.

Yeah ... I was thinking , anyways it's an error condition, so why not
let the server spend a bit more CPU and get the right sub-partition
for the message. If we decide to write code to find the root
partition, then it's just a matter of another function
ExecFindPartition().

Also, I was thinking : give the hint *only* if we know there is a
right sub-partition. Otherwise, it might distract the user.

Thanks,
Amit

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Greg Stark

stark@mit.edu

almost 9 years ago

In reply to: Amit Khandekar (#1)

Re: UPDATE of partition key

On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

There are a few things that can be discussed about :

If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain. So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.

I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.

At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.

I think this would require another bit in the tuple info mask
indicating that this is tuple is the last version before a broken ctid
chain -- i.e. that it was updated by moving it to another partition.
Maybe there's some combination of bits you could use though since this
is only needed in a particular situation.

Offhand I don't know what other behaviours are dependent on the ctid
chain. I think you need to go search the docs -- and probably the code
just to be sure -- for any references to ctid to ensure you catch
every impact of breaking the ctid chain.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

David Fetter

david@fetter.org

almost 9 years ago

In reply to: Amit Khandekar (#10)

Re: UPDATE of partition key

On Thu, Feb 16, 2017 at 03:39:30PM +0530, Amit Khandekar wrote:

and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.

I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify? ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work. I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.

What I had in mind was : Give that hint only if there *was* a
subpartition that could accommodate that row. And if found, we can
only include the subpartition name.

Asking to try the update query with the root table sounds like a good
enough hint. Trying to find the exact sub-partition (I assume you mean to
imply sub-tree here) seems like an overkill, IMHO.

Yeah ... I was thinking , anyways it's an error condition, so why not
let the server spend a bit more CPU and get the right sub-partition
for the message. If we decide to write code to find the root
partition, then it's just a matter of another function
ExecFindPartition().

Also, I was thinking : give the hint *only* if we know there is a
right sub-partition. Otherwise, it might distract the user.

If this is relatively straight-forward, it'd be great. More
actionable knowledge is better.

Thanks for taking this on.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Greg Stark (#11)

Re: UPDATE of partition key

On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark@mit.edu> wrote:

On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

There are a few things that can be discussed about :

If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain. So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.

I think this is a good question.

I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.

Right. EvalPlanQual behavior, in short.

At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.

Now, this part I'm not sure about. What's pretty clear is that,
barring some redesign of the heap format, we can't keep the CTID chain
intact when the tuple moves to a different relfilenode. What's less
clear is what to do about that. We can either (1) give up on
EvalPlanQual behavior in this case and act just as we would if the row
had been deleted; no update happens or (2) throw a serialization
error. You're advocating for #2, but I'm not sure that's right,
because:

1. It's a lot more work,

2. Your proposed implementation needs an on-disk format change that
uses up a scarce infomask bit, and

3. It's not obvious to me that it's clearly preferable from a user
experience standpoint. I mean, either way the user doesn't get the
behavior that they want. Either they're hoping for EPQ semantics and
they instead do a no-op update, or they're hoping for EPQ semantics
and they instead get an ERROR. Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.

More opinions welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Robert Haas (#13)

Re: UPDATE of partition key

On 16 February 2017 at 20:53, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark@mit.edu> wrote:

On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

There are a few things that can be discussed about :

If you do a normal update the new tuple is linked to the old one using
the ctid forming a chain of tuple versions. This tuple movement breaks
that chain. So the question I had reading this proposal is what
behaviour depends on ctid and how is it affected by the ctid chain
being broken.

I think this is a good question.

I think the concurrent update case is just a symptom of this. If you
try to update a row that's locked for a concurrent update you normally
wait until the concurrent update finishes, then follow the ctid chain
and recheck the where clause on the target of the link and if it still
matches you perform the update there.

Right. EvalPlanQual behavior, in short.

At least you do that if you have isolation_level set to
repeatable_read. If you have isolation level set to serializable then
you just fail with a serialization failure. I think that's what you
should do if you come across a row that's been updated with a broken
ctid chain even in repeatable read mode. Just fail with a
serialization failure and document that in partitioned tables if you
perform updates that move tuples between partitions then you need to
be ensure your updates are prepared for serialization failures.

Now, this part I'm not sure about. What's pretty clear is that,
barring some redesign of the heap format, we can't keep the CTID chain
intact when the tuple moves to a different relfilenode. What's less
clear is what to do about that. We can either (1) give up on
EvalPlanQual behavior in this case and act just as we would if the row
had been deleted; no update happens.

This is what the current patch has done.

or (2) throw a serialization
error. You're advocating for #2, but I'm not sure that's right,
because:

1. It's a lot more work,

2. Your proposed implementation needs an on-disk format change that
uses up a scarce infomask bit, and

3. It's not obvious to me that it's clearly preferable from a user
experience standpoint. I mean, either way the user doesn't get the
behavior that they want. Either they're hoping for EPQ semantics and
they instead do a no-op update, or they're hoping for EPQ semantics
and they instead get an ERROR. Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.

More opinions welcome.

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Thomas Munro

thomas.munro@enterprisedb.com

almost 9 years ago

In reply to: Robert Haas (#13)

Re: UPDATE of partition key

On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.

Yes we do:

https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Thomas Munro

thomas.munro@enterprisedb.com

almost 9 years ago

In reply to: Thomas Munro (#15)

Re: UPDATE of partition key

On Mon, Feb 20, 2017 at 3:36 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Generally speaking, we don't throw
serialization errors today at READ COMMITTED, so if we do so here,
that's going to be a noticeable and perhaps unwelcome change.

Yes we do:

https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ

Oops -- please ignore, I misread that as repeatable read.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Khandekar (#1)

Re: UPDATE of partition key

Hi Amit,

Thanks for working on this.

On 2017/02/13 21:01, Amit Khandekar wrote:

Currently, an update of a partition key of a partition is not allowed,
since it requires to move the row(s) into the applicable partition.

Attached is a WIP patch (update-partition-key.patch) that removes this
restriction. When an UPDATE causes the row of a partition to violate
its partition constraint, then a partition is searched in that subtree
that can accommodate this row, and if found, the row is deleted from
the old partition and inserted in the new partition. If not found, an
error is reported.

That's clearly an improvement over what we have now.

There are a few things that can be discussed about :

1. We can run an UPDATE using a child partition at any level in a
nested partition tree. In such case, we should move the row only
within that child subtree.

For e.g. , in a tree such as :
tab ->
t1 ->
t1_1
t1_2
t2 ->
t2_1
t2_2

For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
the UPDATE is fired using t2.

Makes sense. One should perform the update by specifying tab such that
the row moves from t2 to t1, before we could determine t1_1 as the target
for the new row. Specifying t2 directly in that case is clearly the
"violates partition constraint" situation. I wonder if that's enough a
hint for the user to try updating the parent (or better still, root
parent). Or as we were discussing, should there be an actual HINT message
spelling that out for the user.

2. In the patch, as part of the row movement, ExecDelete() is called
followed by ExecInsert(). This is done that way, because we want to
have the ROW triggers on that (sub)partition executed. If a user has
explicitly created DELETE and INSERT BR triggers for this partition, I
think we should run those. While at the same time, another question
is, what about UPDATE trigger on the same table ? Here again, one can
argue that because this UPDATE has been transformed into a
DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
there can be a counter-argument. For e.g. if a user needs to make sure
about logging updates of particular columns of a row, he will expect
the logging to happen even when that row was transparently moved. In
the patch, I have retained the firing of UPDATE BR trigger.

What of UPDATE AR triggers?

As a comment on how row-movement is being handled in code, I wonder if it
could be be made to look similar structurally to the code in ExecInsert()
that handles ON CONFLICT DO UPDATE. That is,

if (partition constraint fails)
{
/* row movement */
}
else
{
/* ExecConstraints() */
/* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
}

I see that ExecConstraint() won't get called on the source partition's
constraints if row movement occurs. Maybe, that's unnecessary because the
new row won't be inserted into that partition anyway.

ExecWithCheckOptions() for RLS update check does happen *before* row
movement though.

3. In case of a concurrent update/delete, suppose session A has locked
the row for deleting it. Now a session B has decided to update this
row and that is going to cause row movement, which means it will
delete it first. But when session A is finished deleting it, session B
finds that it is already deleted. In such case, it should not go ahead
with inserting a new row as part of the row movement. For that, I have
added a new parameter 'already_delete' for ExecDelete().

Makes sense. Maybe: already_deleted -> concurrently_deleted.

Of course, this still won't completely solve the concurrency anomaly.
In the above case, the UPDATE of Session B gets lost. May be, for a
user that does not tolerate this, we can have a table-level option
that disallows row movement, or will cause an error to be thrown for
one of the concurrent session.

Will this table-level option be specified for a partitioned table once or
for individual partitions?

4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
that is to be moved. So in ExecInitModifyTable(), we call
ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
only during execution time for the very first time we find that we
need to do a row movement. I will think over that, but I am thinking
it might complicate things, as compared to always doing the setup for
UPDATE. WIll check on that.

Hmm. ExecSetupPartitionTupleRouting(), which does significant amount of
setup work, is fine being called in ExecInitModifyTable() in the insert
case because there are often cases where that's a bulk-insert and hence
cost of the setup work is amortized. Updates, OTOH, are seldom done in a
bulk manner. So that might be an argument for doing it late only when
needed. But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.

I wonder if updates that will require row movement when done will be done
in a bulk manner (as a maintenance op), so one-time tuple routing setup
seems fine. Again, enable_row_movement option specified for the parent
sounds like it would be a nice to have. Only do the setup if it's turned
on, which goes without saying.

5. Regarding performance testing, I have compared the results of
row-movement with partition versus row-movement with inheritance tree
using triggers. Below are the details :

Schema :

[ ... ]

parts partitioned inheritance no. of rows subpartitions
===== =========== =========== =========== =============

500 10 sec 3 min 02 sec 1,000,000 0
1000 10 sec 3 min 05 sec 1,000,000 0
1000 1 min 38sec 30min 50 sec 10,000,000 0
4000 28 sec 5 min 41 sec 1,000,000 10

part : total number of partitions including subparitions if any.
partitioned : Partitions created using declarative syntax.
inheritence : Partitions created using inheritence , check constraints
and insert,update triggers.
subpartitions : Number of subpartitions for each partition (in a 2-level tree)

Overall the UPDATE in partitions is faster by 10-20 times compared
with inheritance with triggers.

The UPDATE query moved all of the rows into another partition. It was
something like this :
update ptab set a = '1949-01-1' where a <= '1924-01-01'

For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
with 10,000,000 rows, it took 1min 32sec.

Nice!

In general, for both partitioned and inheritence tables, the time
taken linearly rose with the number of rows.

Hopefully not also with the number of partitions though.

I will look more closely at the code soon.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Khandekar (#14)

Re: UPDATE of partition key

On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users. But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.

Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.

I think that's going to make users far more unhappy than breaking the
EPQ behavior ever would.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

David G. Johnston

david.g.johnston@gmail.com

almost 9 years ago

In reply to: Robert Haas (#18)

Re: UPDATE of partition key

On Friday, February 24, 2017, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com
<javascript:;>> wrote:

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users. But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.

For my own sanity - the move update would complete successfully and break
every ctid chain that it touches. Any update lined up behind it in the
lock queue would discover their target record has been deleted and
would experience whatever behavior their isolation level dictates for such
a situation. So multi-partition update queries will fail to update some
records if they happen to move between partitions even if they would
otherwise match the query's predicate.

Is there any difference in behavior between this and a SQL writeable CTE
performing the same thing via delete-returning-insert?

David J.

#20

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: David G. Johnston (#19)

Re: UPDATE of partition key

On Fri, Feb 24, 2017 at 1:18 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:

For my own sanity - the move update would complete successfully and break
every ctid chain that it touches. Any update lined up behind it in the lock
queue would discover their target record has been deleted and would
experience whatever behavior their isolation level dictates for such a
situation. So multi-partition update queries will fail to update some
records if they happen to move between partitions even if they would
otherwise match the query's predicate.

Right. That's the behavior for which I am advocating, on the grounds
that it's the simplest to implement and if we all agree on something
else more complicated later, we can do it then.

Is there any difference in behavior between this and a SQL writeable CTE
performing the same thing via delete-returning-insert?

Not to my knowledge.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Simon Riggs

simon@2ndquadrant.com

almost 9 years ago

In reply to: Robert Haas (#18)

Re: UPDATE of partition key

On 24 February 2017 at 07:02, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 20, 2017 at 2:58 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

I'm still a fan of the "do nothing and just document that this is a
weirdness of partitioned tables" approach, because implementing
something will be complicated, will ensure that this misses this
release if not the next one, and may not be any better for users. But
probably we need to get some more opinions from other people, since I
can imagine people being pretty unhappy if the consensus happens to be
at odds with my own preferences.

I'd give the view that we cannot silently ignore this issue, bearing
in mind the point that we're expecting partitioned tables to behave
exactly like normal tables.

In my understanding the issue is that UPDATEs will fail to update a
row when a valid row exists in the case where a row moved between
partitions; that behaviour will be different to a standard table.

It is of course very good that we have something ready for this
release and can make a choice of what to do.

Thoughts

1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
almost exactly the same thing. An UPDATE which gets to a
HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
metadata, or I might be persuaded that in-this-release it is
acceptable to fail when this occurs with an ERROR and a retryable
SQLCODE, since the UPDATE will succeed on next execution.

2. I know that DB2 handles this by having the user specify WITH ROW
MOVEMENT to explicitly indicate they accept the issue and want update
to work even with that. We could have an explicit option to allow
that. This appears to be the only way we could avoid silent errors for
foreign table partitions.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Simon Riggs (#21)

Re: UPDATE of partition key

On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'd give the view that we cannot silently ignore this issue, bearing
in mind the point that we're expecting partitioned tables to behave
exactly like normal tables.

At the risk of repeating myself, I don't expect that, and I don't
think it's a reasonable expectation. It's reasonable to expect
partitioning to be notably better than inheritance (which I think it
already is) and to provide a good base for future work (which I think
it does), but I think getting them to behave exactly like normal
tables (except for the things we want to be different) will take
another ten years of development work.

In my understanding the issue is that UPDATEs will fail to update a
row when a valid row exists in the case where a row moved between
partitions; that behaviour will be different to a standard table.

Right, when at READ COMMITTED and EvalPlanQual would have happened otherwise.

It is of course very good that we have something ready for this
release and can make a choice of what to do.

Thoughts

1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
almost exactly the same thing. An UPDATE which gets to a
HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
metadata, or I might be persuaded that in-this-release it is
acceptable to fail when this occurs with an ERROR and a retryable
SQLCODE, since the UPDATE will succeed on next execution.

I've got my doubts about whether we can make that bit work that way,
considering that we still support pg_upgrade (possibly in multiple
steps) from old releases that had VACUUM FULL. We really ought to put
some work into reclaiming those old bits, but there's probably no time
for that in v10.

2. I know that DB2 handles this by having the user specify WITH ROW
MOVEMENT to explicitly indicate they accept the issue and want update
to work even with that. We could have an explicit option to allow
that. This appears to be the only way we could avoid silent errors for
foreign table partitions.

Yeah, that's a thought. We could give people a choice between (a)
updates that cause rows to move between partitions just fail and (b)
such updates work but with EPQ-related deficiencies. I had previously
thought that, given those two choices, everybody would like (b) better
than (a), but maybe not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

David G. Johnston

david.g.johnston@gmail.com

almost 9 years ago

In reply to: Simon Riggs (#21)

Re: UPDATE of partition key

On Friday, February 24, 2017, Simon Riggs <simon@2ndquadrant.com> wrote:

2. I know that DB2 handles this by having the user specify WITH ROW
MOVEMENT to explicitly indicate they accept the issue and want update
to work even with that. We could have an explicit option to allow
that. This appears to be the only way we could avoid silent errors for
foreign table partitions.

This does, however, make the partitioning very non-transparent to every
update query just because it is remotely possible a partition-moving update
might occur concurrently.

I dislike an error. I'd say that making partition "just work" here is
material for another patch. In this one an update of the partition key can
be documented as shorthand for delete-returning-insert with all the
limitations that go with that. If someone acceptably solves the
ctid following logic later it can be committed - I'm assuming there would
be no complaints to making things just work in a case where they only sorta
worked.

David J.

#24

Greg Stark

stark@mit.edu

almost 9 years ago

In reply to: David G. Johnston (#23)

Re: UPDATE of partition key

On 24 February 2017 at 14:57, David G. Johnston
<david.g.johnston@gmail.com> wrote:

I dislike an error. I'd say that making partition "just work" here is
material for another patch. In this one an update of the partition key can
be documented as shorthand for delete-returning-insert with all the
limitations that go with that. If someone acceptably solves the ctid
following logic later it can be committed - I'm assuming there would be no
complaints to making things just work in a case where they only sorta
worked.

Personally I don't think there's any hope that there will ever be
cross-table ctids links. Maybe one day there will be a major new table
storage format with very different capabilities than today but in the
current architecture it seems like an impossible leap.

I would expect everyone to come to terms with the basic idea that
partition key updates are always going to be a corner case. The user
defined the partition key and the docs should carefully explain to
them the impact of that definition. As long as that explanation gives
them something they can work with and manage the consequences of
that's going to be fine.

What I'm concerned about is that silently giving "wrong" answers in
regular queries -- not even ones doing the partition key updates -- is
something the user can't really manage. They have no way to rewrite
the query to avoid the problem if some other user or part of their
system is updating partition keys. They have no way to know the
problem is even occurring.

Just to spell it out -- it's not just "no-op updates" where the user
sees 0 records updated. If I update all records where
username='stark', perhaps to set the "user banned" flag and get back
"9 records updated" and later find out that I missed a record because
someone changed the department_id while my query was running -- how
would I even know? How could I possibly rewrite my query to avoid
that?

The reason I suggested throwing a serialization failure was because I
thought that would be the easiest short-cut to the problem. I had
imagined having a bit pattern that indicated such a move would
actually be a pretty minor change actually. I would actually consider
using a normal update bitmask with InvalidBlockId in the ctid to
indicate the tuple was updated and the target of the chain isn't
available. That may be something we'll need in the future for other
cases too.

Throwing an error means the user has to retry their query but that's
at least something they can do. Even if they don't do it automatically
the ultimate user will probably just retry whatever operation errored
out anyways. But at least their database isn't logically corrupted.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

David G. Johnston

david.g.johnston@gmail.com

almost 9 years ago

In reply to: Greg Stark (#24)

Re: UPDATE of partition key

On Sat, Feb 25, 2017 at 11:11 AM, Greg Stark <stark@mit.edu> wrote:

On 24 February 2017 at 14:57, David G. Johnston
<david.g.johnston@gmail.com> wrote:

I dislike an error. I'd say that making partition "just work" here is
material for another patch. In this one an update of the partition key

can

be documented as shorthand for delete-returning-insert with all the
limitations that go with that. If someone acceptably solves the ctid
following logic later it can be committed - I'm assuming there would be

no

complaints to making things just work in a case where they only sorta
worked.

Personally I don't think there's any hope that there will ever be
cross-table ctids links. Maybe one day there will be a major new table
storage format with very different capabilities than today but in the
current architecture it seems like an impossible leap.

How about making it work without a physical token dynamic? For instance,
let the server recognize the serialization error but instead of returning
it to the client the server itself tries again.

I would expect everyone to come to terms with the basic idea that
partition key updates are always going to be a corner case. The user
defined the partition key and the docs should carefully explain to
them the impact of that definition. As long as that explanation gives
them something they can work with and manage the consequences of
that's going to be fine.

What I'm concerned about is that silently giving "wrong" answers in
regular queries -- not even ones doing the partition key updates -- is
something the user can't really manage. They have no way to rewrite
the query to avoid the problem if some other user or part of their
system is updating partition keys. They have no way to know the
problem is even occurring.

Just to spell it out -- it's not just "no-op updates" where the user
sees 0 records updated. If I update all records where
username='stark', perhaps to set the "user banned" flag and get back
"9 records updated" and later find out that I missed a record because
someone changed the department_id while my query was running -- how
would I even know? How could I possibly rewrite my query to avoid
that?

But my point is that this isn't a regression from current behavior. If I
deleted one of those starks and re-inserted them with a different
department_id that brand new record wouldn't be banned. In short, my take
on this patch is that it is a performance optimization. Making the UPDATE
command actually work as part of its implementation detail is a happy
byproduct.

From the POV of an external observer it doesn't have to matter whether the
update or delete-insert SQL was used. It would be nice if the UPDATE
version could keep logical identity maintained but that is a feature
enhancement.

Failing if the other session used the UPDATE SQL isn't wrong; and I'm not
against it, I just don't believe that it is better than maintaining the
status quo semantics.

That said my concurrency-fu is not that strong and I don't really have a
practical reason to prefer one over the other - thus I fall back on
maintaining internal consistency.

IIUC it is already possible, for those who care to do so, to get a
serialization failure in this scenario by upgrading isolation to repeatable
read.

David J.

#26

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Greg Stark (#24)

Re: UPDATE of partition key

On Sat, Feb 25, 2017 at 11:41 PM, Greg Stark <stark@mit.edu> wrote:

What I'm concerned about is that silently giving "wrong" answers in
regular queries -- not even ones doing the partition key updates -- is
something the user can't really manage. They have no way to rewrite
the query to avoid the problem if some other user or part of their
system is updating partition keys. They have no way to know the
problem is even occurring.

That's a reasonable concern, but it's not like EvalPlanQual works
perfectly today and never causes any application-visible
inconsistencies that end up breaking things. As the documentation
says:

----
Because of the above rules, it is possible for an updating command to
see an inconsistent snapshot: it can see the effects of concurrent
updating commands on the same rows it is trying to update, but it does
not see effects of those commands on other rows in the database. This
behavior makes Read Committed mode unsuitable for commands that
involve complex search conditions; however, it is just right for
simpler cases.
----

Maybe I've just spent too long hanging out with Kevin Grittner, but
I've come to view our EvalPlanQual behavior as pretty rickety and
unreliable in general. For example, consider the fact that when I
spent over a year and approximately 1 gazillion email messages trying
to hammer out how join pushdown was going to EPQ rechecks, we
discovered that the FDW API wasn't actually handling those correctly
for even for scans of single tables, hence commit
5fc4c26db5120bd90348b6ee3101fcddfdf54800. I'm not saying that time
and effort wasn't well-spent, but I wonder whether it's necessary to
hold partitioned tables to a higher standard than that to which the
FDW interface was held for the first 4.5 years of its life. Perhaps
it is good for us to do that, but I'm not 100% convinced. It seems
like we decide to worry about EvalPlanQual in some cases and not in
others more or less arbitrarily.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Langote (#17)

Re: UPDATE of partition key

On 23 February 2017 at 16:02, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

2. In the patch, as part of the row movement, ExecDelete() is called
followed by ExecInsert(). This is done that way, because we want to
have the ROW triggers on that (sub)partition executed. If a user has
explicitly created DELETE and INSERT BR triggers for this partition, I
think we should run those. While at the same time, another question
is, what about UPDATE trigger on the same table ? Here again, one can
argue that because this UPDATE has been transformed into a
DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
there can be a counter-argument. For e.g. if a user needs to make sure
about logging updates of particular columns of a row, he will expect
the logging to happen even when that row was transparently moved. In
the patch, I have retained the firing of UPDATE BR trigger.

What of UPDATE AR triggers?

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

As a comment on how row-movement is being handled in code, I wonder if it
could be be made to look similar structurally to the code in ExecInsert()
that handles ON CONFLICT DO UPDATE. That is,

if (partition constraint fails)
{
/* row movement */
}
else
{
/* ExecConstraints() */
/* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
}

I guess this is what has been effectively done for row movement, no ?

Looking at that, I found that in the current patch, if there is no
row-movement happening, ExecPartitionCheck() effectively gets called
twice : First time when ExecPartitionCheck() is explicitly called for
row-movement-required check, and second time in ExecConstraints()
call. May be there should be 2 separate functions
ExecCheckConstraints() and ExecPartitionConstraints(), and also
ExecCheckConstraints() that just calls both. This way we can call the
appropriate functions() accordingly in row-movement case, and the
other callers would continue to call ExecConstraints().

I see that ExecConstraint() won't get called on the source partition's
constraints if row movement occurs. Maybe, that's unnecessary because the
new row won't be inserted into that partition anyway.

Yes I agree.

ExecWithCheckOptions() for RLS update check does happen *before* row
movement though.

Yes. I think we should do it anyways.

3. In case of a concurrent update/delete, suppose session A has locked
the row for deleting it. Now a session B has decided to update this
row and that is going to cause row movement, which means it will
delete it first. But when session A is finished deleting it, session B
finds that it is already deleted. In such case, it should not go ahead
with inserting a new row as part of the row movement. For that, I have
added a new parameter 'already_delete' for ExecDelete().

Makes sense. Maybe: already_deleted -> concurrently_deleted.

Right, concurrently_deleted sounds more accurate. In the next patch, I
will change that.

Of course, this still won't completely solve the concurrency anomaly.
In the above case, the UPDATE of Session B gets lost. May be, for a
user that does not tolerate this, we can have a table-level option
that disallows row movement, or will cause an error to be thrown for
one of the concurrent session.

Will this table-level option be specified for a partitioned table once or
for individual partitions?

My opinion is, if decide to have table-level option, it should be on
the root partition, to keep it simple.

4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
that is to be moved. So in ExecInitModifyTable(), we call
ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
only during execution time for the very first time we find that we
need to do a row movement. I will think over that, but I am thinking
it might complicate things, as compared to always doing the setup for
UPDATE. WIll check on that.

Hmm. ExecSetupPartitionTupleRouting(), which does significant amount of
setup work, is fine being called in ExecInitModifyTable() in the insert
case because there are often cases where that's a bulk-insert and hence
cost of the setup work is amortized. Updates, OTOH, are seldom done in a
bulk manner. So that might be an argument for doing it late only when
needed.

Yes, agreed.

But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.

If we manage to call ExecSetupPartitionTupleRouting() during execution
phase only once for the very first time we find the update requires
row movement, then we can re-use the info.

One more thing I noticed is that, in case of update-returning, the
ExecDelete() will also generate result of RETURNING, which we are
discarding. So this is a waste. We should not even process RETURNING
in ExecDelete() called for row-movement. The RETURNING should be
processed only for ExecInsert().

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: David G. Johnston (#25)

Re: UPDATE of partition key

On 2017/02/26 4:01, David G. Johnston wrote:

IIUC it is already possible, for those who care to do so, to get a
serialization failure in this scenario by upgrading isolation to repeatable
read.

Maybe, this can be added as a note in the documentation.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Khandekar (#27)

Re: UPDATE of partition key

On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition? It seems weird to run BR
update triggers but not AR update triggers. Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Khandekar (#27)

Re: UPDATE of partition key

Hi,

On 2017/03/02 15:23, Amit Khandekar wrote:

On 23 February 2017 at 16:02, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

2. In the patch, as part of the row movement, ExecDelete() is called
followed by ExecInsert(). This is done that way, because we want to
have the ROW triggers on that (sub)partition executed. If a user has
explicitly created DELETE and INSERT BR triggers for this partition, I
think we should run those. While at the same time, another question
is, what about UPDATE trigger on the same table ? Here again, one can
argue that because this UPDATE has been transformed into a
DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
there can be a counter-argument. For e.g. if a user needs to make sure
about logging updates of particular columns of a row, he will expect
the logging to happen even when that row was transparently moved. In
the patch, I have retained the firing of UPDATE BR trigger.

What of UPDATE AR triggers?

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

OK, so it'd be better to clarify in the documentation that that's the case.

As a comment on how row-movement is being handled in code, I wonder if it
could be be made to look similar structurally to the code in ExecInsert()
that handles ON CONFLICT DO UPDATE. That is,

if (partition constraint fails)
{
/* row movement */
}
else
{
/* ExecConstraints() */
/* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
}

I guess this is what has been effectively done for row movement, no ?

Yes, although it seems nice how the formatting of the code in ExecInsert()
makes it apparent that they are distinct code paths. OTOH, the additional
diffs caused by the suggested formatting might confuse other reviewers.

Looking at that, I found that in the current patch, if there is no
row-movement happening, ExecPartitionCheck() effectively gets called
twice : First time when ExecPartitionCheck() is explicitly called for
row-movement-required check, and second time in ExecConstraints()
call. May be there should be 2 separate functions
ExecCheckConstraints() and ExecPartitionConstraints(), and also
ExecCheckConstraints() that just calls both. This way we can call the
appropriate functions() accordingly in row-movement case, and the
other callers would continue to call ExecConstraints().

One random idea: we could add a bool ri_PartitionCheckOK which is set to
true after it is checked in ExecConstraints(). And modify the condition
in ExecConstraints() as follows:

if (resultRelInfo->ri_PartitionCheck &&
+ !resultRelInfo->ri_PartitionCheckOK &&
!ExecPartitionCheck(resultRelInfo, slot, estate))

3. In case of a concurrent update/delete, suppose session A has locked
the row for deleting it. Now a session B has decided to update this
row and that is going to cause row movement, which means it will
delete it first. But when session A is finished deleting it, session B
finds that it is already deleted. In such case, it should not go ahead
with inserting a new row as part of the row movement. For that, I have
added a new parameter 'already_delete' for ExecDelete().

Makes sense. Maybe: already_deleted -> concurrently_deleted.

Right, concurrently_deleted sounds more accurate. In the next patch, I
will change that.

Okay, thanks.

Of course, this still won't completely solve the concurrency anomaly.
In the above case, the UPDATE of Session B gets lost. May be, for a
user that does not tolerate this, we can have a table-level option
that disallows row movement, or will cause an error to be thrown for
one of the concurrent session.

Will this table-level option be specified for a partitioned table once or
for individual partitions?

My opinion is, if decide to have table-level option, it should be on
the root partition, to keep it simple.

I see.

But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.

If we manage to call ExecSetupPartitionTupleRouting() during execution
phase only once for the very first time we find the update requires
row movement, then we can re-use the info.

That might work, too. But I guess we're going with initialization in
ExecInitModifyTable().

One more thing I noticed is that, in case of update-returning, the
ExecDelete() will also generate result of RETURNING, which we are
discarding. So this is a waste. We should not even process RETURNING
in ExecDelete() called for row-movement. The RETURNING should be
processed only for ExecInsert().

I wonder if it makes sense to have ExecDeleteInternal() and
ExecInsertInternal(), which perform the core function of DELETE and
INSERT, respectively. Such as running triggers, checking constraints,
etc. The RETURNING part is controllable by the statement, so it will be
handled by the ExecDelete() and ExecInsert(), like it is now.

When called from ExecUpdate() as part of row-movement, they perform just
the core part and leave the rest to be done by ExecUpdate() itself.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Langote (#30)

Re: UPDATE of partition key

I haven't yet handled all points, but meanwhile, some of the important
points are discussed below ...

On 6 March 2017 at 15:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.

If we manage to call ExecSetupPartitionTupleRouting() during execution
phase only once for the very first time we find the update requires
row movement, then we can re-use the info.

That might work, too. But I guess we're going with initialization in
ExecInitModifyTable().

I am more worried about this: even the UPDATEs that do not involve row
movement would do the expensive setup. So do it only once when we find
that we need to move the row. Something like this :
ExecUpdate()
{
....
if (resultRelInfo->ri_PartitionCheck &&
!ExecPartitionCheck(resultRelInfo, slot, estate))
{
bool already_deleted;

ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
&already_deleted, canSetTag);

if (already_deleted)
return NULL;
else
{
/* If we haven't already built the state for INSERT
* tuple routing, build it now */
if (!mtstate->mt_partition_dispatch_info)
{
ExecSetupPartitionTupleRouting(
mtstate->resultRelInfo->ri_RelationDesc,
&mtstate->mt_partition_dispatch_info,
&mtstate->mt_partitions,
&mtstate->mt_partition_tupconv_maps,
&mtstate->mt_partition_tuple_slot,
&mtstate->mt_num_dispatch,
&mtstate->mt_num_partitions);
}

return ExecInsert(mtstate, slot, planSlot, NULL,
ONCONFLICT_NONE, estate, false);
}
}
...
}

One more thing I noticed is that, in case of update-returning, the
ExecDelete() will also generate result of RETURNING, which we are
discarding. So this is a waste. We should not even process RETURNING
in ExecDelete() called for row-movement. The RETURNING should be
processed only for ExecInsert().

I wonder if it makes sense to have ExecDeleteInternal() and
ExecInsertInternal(), which perform the core function of DELETE and
INSERT, respectively. Such as running triggers, checking constraints,
etc. The RETURNING part is controllable by the statement, so it will be
handled by the ExecDelete() and ExecInsert(), like it is now.

When called from ExecUpdate() as part of row-movement, they perform just
the core part and leave the rest to be done by ExecUpdate() itself.

Yes, if we decide to execute only the core insert/delete operations
and skip the triggers, then there is a compelling reason to have
something like ExecDeleteInternal() and ExecInsertInternal(). In fact,
I was about to start doing the same, except for the below discussion
...

On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition? It seems weird to run BR
update triggers but not AR update triggers. Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

I checked the trigger behaviour in case of UPSERT. Here, when there is
conflict found, ExecOnConflictUpdate() is called, and then the
function returns immediately, which means AR INSERT trigger will not
fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
and AR UPDATE triggers will be fired. So in short, when an INSERT
becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
and AR UPDATE also get fired. On the same lines, it makes sense in
case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
the original table, and then the BR and AR DELETE/INSERT triggers on
the respective tables.

So the common policy can be :
Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending
upon what the statement is.
If there is a change in the operation, according to what the operation
is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the
respective triggers would be fired.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Khandekar (#31)

1 attachment(s)

Re: UPDATE of partition key

On 17 March 2017 at 16:07, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 6 March 2017 at 15:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.

If we manage to call ExecSetupPartitionTupleRouting() during execution
phase only once for the very first time we find the update requires
row movement, then we can re-use the info.

That might work, too. But I guess we're going with initialization in
ExecInitModifyTable().

I am more worried about this: even the UPDATEs that do not involve row
movement would do the expensive setup. So do it only once when we find
that we need to move the row. Something like this :
ExecUpdate()
{
....
if (resultRelInfo->ri_PartitionCheck &&
!ExecPartitionCheck(resultRelInfo, slot, estate))
{
bool already_deleted;

ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
&already_deleted, canSetTag);

if (already_deleted)
return NULL;
else
{
/* If we haven't already built the state for INSERT
* tuple routing, build it now */
if (!mtstate->mt_partition_dispatch_info)
{
ExecSetupPartitionTupleRouting(
mtstate->resultRelInfo->ri_RelationDesc,
&mtstate->mt_partition_dispatch_info,
&mtstate->mt_partitions,
&mtstate->mt_partition_tupconv_maps,
&mtstate->mt_partition_tuple_slot,
&mtstate->mt_num_dispatch,
&mtstate->mt_num_partitions);
}

return ExecInsert(mtstate, slot, planSlot, NULL,
ONCONFLICT_NONE, estate, false);
}
}
...
}

Attached is v2 patch which implements the above optimization. Now, for
UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
movement is needed.

We have to open an extra relation for the root partition, and keep it
opened and its handle stored in
mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
this if it is different from node->resultRelInfo->ri_RelationDesc. If
it is same as node->resultRelInfo, it should not be closed because it
gets closed as part of ExecEndPlan().

Attachments:

update-partition-key_v2.patchapplication/octet-stream; name=update-partition-key_v2.patchDownload

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 023ea00..2cb9914 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1764,7 +1764,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29c6a6e..da1eb2f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -625,6 +625,7 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *already_deleted,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (already_deleted)
+		*already_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -776,6 +780,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (already_deleted)
+					*already_deleted = true;
 				return NULL;
 
 			default:
@@ -878,7 +884,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid,
 	}
 	else
 	{
-		LockTupleMode lockmode;
+		LockTupleMode	lockmode;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -987,6 +994,66 @@ lreplace:;
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
 
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	is_partitioned_table = true;
+
+			if (!mtstate->mt_partition_dispatch_info)
+			{
+				ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+				Relation root_rel;
+
+				/* root table RT index is at the head of partitioned_rels */
+				if (node->partitioned_rels)
+				{
+					Index	root_rti;
+					Oid		root_oid;
+
+					root_rti = linitial_int(node->partitioned_rels);
+					root_oid = getrelid(root_rti, estate->es_range_table);
+					root_rel = heap_open(root_oid, NoLock);	/* locked by InitPlan */
+				}
+				else
+					root_rel = mtstate->resultRelInfo->ri_RelationDesc;
+
+				is_partitioned_table =
+					root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+				if (is_partitioned_table)
+					ExecSetupPartitionTupleRouting(
+										root_rel,
+										&mtstate->mt_partition_dispatch_info,
+										&mtstate->mt_partitions,
+										&mtstate->mt_partition_tupconv_maps,
+										&mtstate->mt_partition_tuple_slot,
+										&mtstate->mt_num_dispatch,
+										&mtstate->mt_num_partitions);
+			}
+
+			/*
+			 * If it's not a partitioned table after all, let it fall through
+			 * the usual error handling.
+			 */
+			if (is_partitioned_table)
+			{
+				bool	already_deleted;
+
+				ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+						   &already_deleted, canSetTag);
+
+				if (already_deleted)
+					return NULL;
+
+				/*
+				 * Don't update estate.es_processed updated again. ExecDelete()
+				 * has already done it above. So use canSetTag=false.
+				 */
+				return ExecInsert(mtstate, slot, planSlot, NULL,
+									  ONCONFLICT_NONE, estate, false);
+			}
+		}
+
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
@@ -1313,7 +1380,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1583,12 +1650,12 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate, NULL, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node)
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
 	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
-	 * partitioned table, which we must not try to close, because it is the
-	 * main target table of the query that will be closed by ExecEndPlan().
-	 * Also, tupslot is NULL for the root partitioned table.
+	 * partitioned table, which should not be closed if it is the main target
+	 * table of the query, which will be closed by ExecEndPlan(). Also, tupslot
+	 * is NULL for the root partitioned table.
 	 */
+	if (node->mt_num_dispatch > 0)
+	{
+		Relation	root_partition;
+
+		root_partition = node->mt_partition_dispatch_info[0]->reldesc;
+		if (root_partition != node->resultRelInfo->ri_RelationDesc)
+			heap_close(root_partition, NoLock);
+	}
+
 	for (i = 1; i < node->mt_num_dispatch; i++)
 	{
 		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
@@ -2165,6 +2241,7 @@ ExecEndModifyTable(ModifyTableState *node)
 		heap_close(pd->reldesc, NoLock);
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
+
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index e64d6fb..d42210e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -224,6 +224,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..99c8046 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -209,13 +209,12 @@ create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to
 create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
 insert into part_a_1_a_10 values ('a', 1);
 insert into part_b_10_b_20 values ('b', 10);
--- fail
+-- fail (row movement happens only within the partition subtree)
 update part_a_1_a_10 set a = 'b' where a = 'a';
 ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
 DETAIL:  Failing row contains (b, 1).
+-- ok (row movement)
 update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
 -- ok
 update range_parted set b = b + 1 where b = 10;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..7667793 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -119,8 +119,9 @@ create table part_b_10_b_20 partition of range_parted for values from ('b', 10)
 insert into part_a_1_a_10 values ('a', 1);
 insert into part_b_10_b_20 values ('b', 10);
 
--- fail
+-- fail (row movement happens only within the partition subtree)
 update part_a_1_a_10 set a = 'b' where a = 'a';
+-- ok (row movement)
 update range_parted set b = b - 1 where b = 10;
 -- ok
 update range_parted set b = b + 1 where b = 10;

#33

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Khandekar (#32)

Re: UPDATE of partition key

Hi Amit,

Thanks for the updated patch.

On 2017/03/23 3:09, Amit Khandekar wrote:

Attached is v2 patch which implements the above optimization.

Would it be better to have at least some new tests? Also, there are a few
places in the documentation mentioning that such updates cause error,
which will need to be updated. Perhaps also add some explanatory notes
about the mechanism (delete+insert), trigger behavior, caveats, etc.
There were some points discussed upthread that could be mentioned in the
documentation.

@@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid,
HeapUpdateFailureData hufd;
TupleTableSlot *slot = NULL;

+    if (already_deleted)
+        *already_deleted = false;
+

concurrently_deleted?

@@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid,
     }
     else
     {
-        LockTupleMode lockmode;
+        LockTupleMode   lockmode;

Useless hunk.

+            if (!mtstate->mt_partition_dispatch_info)
+            {

The if (pointer == NULL) style is better perhaps.

+                /* root table RT index is at the head of partitioned_rels */
+                if (node->partitioned_rels)
+                {
+                    Index   root_rti;
+                    Oid     root_oid;
+
+                    root_rti = linitial_int(node->partitioned_rels);
+                    root_oid = getrelid(root_rti, estate->es_range_table);
+                    root_rel = heap_open(root_oid, NoLock); /* locked by
InitPlan */
+                }
+                else
+                    root_rel = mtstate->resultRelInfo->ri_RelationDesc;

Some explanatory comments here might be good, for example, explain in what
situations node->partitioned_rels would not have been set and/or vice versa.

Now, for
UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
movement is needed.

We have to open an extra relation for the root partition, and keep it
opened and its handle stored in
mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
this if it is different from node->resultRelInfo->ri_RelationDesc. If
it is same as node->resultRelInfo, it should not be closed because it
gets closed as part of ExecEndPlan().

I guess you're referring to the following hunk. Some comments:

@@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node)
      * Close all the partitioned tables, leaf partitions, and their indices
      *
      * Remember node->mt_partition_dispatch_info[0] corresponds to the root
-     * partitioned table, which we must not try to close, because it is the
-     * main target table of the query that will be closed by ExecEndPlan().
-     * Also, tupslot is NULL for the root partitioned table.
+     * partitioned table, which should not be closed if it is the main target
+     * table of the query, which will be closed by ExecEndPlan().

The last part could be written as: because it will be closed by ExecEndPlan().

 Also, tupslot
+     * is NULL for the root partitioned table.
      */
+    if (node->mt_num_dispatch > 0)
+    {
+        Relation    root_partition;

root_relation?

+
+        root_partition = node->mt_partition_dispatch_info[0]->reldesc;
+        if (root_partition != node->resultRelInfo->ri_RelationDesc)
+            heap_close(root_partition, NoLock);
+    }

It might be a good idea to Assert inside the if block above that
node->operation != CMD_INSERT. Perhaps, also reflect that in the comment
above so that it's clearer.

I will set the patch to Waiting on Author.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Langote (#33)

1 attachment(s)

Re: UPDATE of partition key

Thanks Amit for your review comments. I am yet to handle all of your
comments, but meanwhile , attached is an updated patch, that handles
RETURNING.

Earlier it was not working because ExecInsert() did not return any
RETURNING clause. This is because the setup needed to create RETURNIG
projection info for leaf partitions is done in ExecInitModifyTable()
only in case of INSERT. But because it is an UPDATE operation, we have
to do this explicitly as a one-time operation when it is determined
that row-movement is required. This is similar to how we do one-time
setup of mt_partition_dispatch_info. So in the patch, I have moved
this code into a new function ExecInitPartitionReturningProjection(),
and now this is called in ExecInitModifyTable() as well as during row
movement for ExecInsert() processing the returning clause.

Basically we need to do all that is done in ExecInitModifyTable() for
INSERT. There are a couple of other things that I suspect that might
need to be done as part of the missing initialization for Execinsert()
during row-movement :
1. Junk filter handling
2. WITH CHECK OPTION

Yet, ExecDelete() during row-movement is still returning the RETURNING
result redundantly, which I am yet to handle this.

On 23 March 2017 at 07:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Amit,

Thanks for the updated patch.

On 2017/03/23 3:09, Amit Khandekar wrote:

Attached is v2 patch which implements the above optimization.

Would it be better to have at least some new tests? Also, there are a few
places in the documentation mentioning that such updates cause error,
which will need to be updated. Perhaps also add some explanatory notes
about the mechanism (delete+insert), trigger behavior, caveats, etc.
There were some points discussed upthread that could be mentioned in the
documentation.

Yeah, agreed. Will do this in the subsequent patch.

@@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid,
HeapUpdateFailureData hufd;
TupleTableSlot *slot = NULL;
+    if (already_deleted)
+        *already_deleted = false;
+
concurrently_deleted?

Done.

@@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid,
}
else
{
-        LockTupleMode lockmode;
+        LockTupleMode   lockmode;

Useless hunk.

Removed.

I am yet to handle your other comments , still working on them, but
till then , attached is the updated patch.

Attachments:

update-partition-key_v3.patchapplication/octet-stream; name=update-partition-key_v3.patchDownload

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c28cf9c..2878ff3 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1778,7 +1778,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29c6a6e..287af13 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -62,7 +62,7 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecInitPartitionReturningProjection(ModifyTableState *mtstate, Relation root_rel);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -625,6 +625,7 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -776,6 +780,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -878,7 +884,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -987,6 +994,69 @@ lreplace:;
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
 
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	is_partitioned_table = true;
+
+			if (!mtstate->mt_partition_dispatch_info)
+			{
+				ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+				Relation root_rel;
+
+				/* root table RT index is at the head of partitioned_rels */
+				if (node->partitioned_rels)
+				{
+					Index	root_rti;
+					Oid		root_oid;
+
+					root_rti = linitial_int(node->partitioned_rels);
+					root_oid = getrelid(root_rti, estate->es_range_table);
+					root_rel = heap_open(root_oid, NoLock);	/* locked by InitPlan */
+				}
+				else
+					root_rel = mtstate->resultRelInfo->ri_RelationDesc;
+
+				is_partitioned_table =
+					root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+				if (is_partitioned_table)
+					ExecSetupPartitionTupleRouting(
+										root_rel,
+										&mtstate->mt_partition_dispatch_info,
+										&mtstate->mt_partitions,
+										&mtstate->mt_partition_tupconv_maps,
+										&mtstate->mt_partition_tuple_slot,
+										&mtstate->mt_num_dispatch,
+										&mtstate->mt_num_partitions);
+
+				/* Build a projection for each leaf partition rel. */
+				ExecInitPartitionReturningProjection(mtstate, root_rel);
+			}
+
+			/*
+			 * If it's not a partitioned table after all, let it fall through
+			 * the usual error handling.
+			 */
+			if (is_partitioned_table)
+			{
+				bool	concurrently_deleted;
+
+				ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+						   &concurrently_deleted, canSetTag);
+
+				if (concurrently_deleted)
+					return NULL;
+
+				/*
+				 * Don't update estate.es_processed updated again. ExecDelete()
+				 * has already done it above. So use canSetTag=false.
+				 */
+				return ExecInsert(mtstate, slot, planSlot, NULL,
+									  ONCONFLICT_NONE, estate, false);
+			}
+		}
+
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
@@ -1313,7 +1383,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1583,12 +1653,12 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate, NULL, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1837,7 +1907,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1872,30 +1941,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		}
 
 		/*
-		 * Build a projection for each leaf partition rel.  Note that we
-		 * didn't build the returningList for each partition within the
-		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * Build a projection for each leaf partition rel. This only occurs for
+		 * the INSERT case; UPDATE/DELETE are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *rlist,
-					   *rliststate;
-
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
-			rliststate = (List *) ExecInitExpr((Expr *) rlist, &mtstate->ps);
-			resultRelInfo->ri_projectReturning =
-				ExecBuildProjectionInfo(rliststate, econtext, slot,
-									 resultRelInfo->ri_RelationDesc->rd_att);
-			resultRelInfo++;
-		}
+		ExecInitPartitionReturningProjection(mtstate, rel);
 	}
 	else
 	{
@@ -2124,6 +2173,56 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 }
 
 /* ----------------------------------------------------------------
+ *		ExecInitPartitionReturningProjection
+ *
+ * Initialize stuff required to handle RETURNING for leaf partitions.
+ * We don't build the returningList for each partition within the planner, but
+ * simple translation of the varattnos for each partition suffices.  This
+ * actually is helpful only for INSERT case; UPDATE/DELETE are handled
+ * differently.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecInitPartitionReturningProjection(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	TupleTableSlot *returning_slot = mtstate->ps.ps_ResultTupleSlot;
+	List		   *returningList;
+	int				i;
+
+	/*
+	 * If there is no returning clause, or if we have already initialized the
+	 * returning projection info, there is nothing to be done.
+	 */
+	if (node->returningLists == NIL ||
+		(resultRelInfo && resultRelInfo->ri_projectReturning != NULL) ||
+		mtstate->mt_num_partitions == 0)
+		return;
+
+	returningList = linitial(node->returningLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *rlist,
+				   *rliststate;
+
+		/* varno = node->nominalRelation */
+		rlist = map_partition_varattnos(returningList,
+										node->nominalRelation,
+										partrel, root_rel);
+		rliststate = (List *) ExecInitExpr((Expr *) rlist, &mtstate->ps);
+		resultRelInfo->ri_projectReturning =
+			ExecBuildProjectionInfo(rliststate,
+									mtstate->ps.ps_ExprContext,
+									returning_slot,
+									resultRelInfo->ri_RelationDesc->rd_att);
+		resultRelInfo++;
+	}
+}
+
+
+/* ----------------------------------------------------------------
  *		ExecEndModifyTable
  *
  *		Shuts down the plan.
@@ -2154,10 +2253,19 @@ ExecEndModifyTable(ModifyTableState *node)
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
 	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
-	 * partitioned table, which we must not try to close, because it is the
-	 * main target table of the query that will be closed by ExecEndPlan().
-	 * Also, tupslot is NULL for the root partitioned table.
+	 * partitioned table, which should not be closed if it is the main target
+	 * table of the query, which will be closed by ExecEndPlan(). Also, tupslot
+	 * is NULL for the root partitioned table.
 	 */
+	if (node->mt_num_dispatch > 0)
+	{
+		Relation	root_partition;
+
+		root_partition = node->mt_partition_dispatch_info[0]->reldesc;
+		if (root_partition != node->resultRelInfo->ri_RelationDesc)
+			heap_close(root_partition, NoLock);
+	}
+
 	for (i = 1; i < node->mt_num_dispatch; i++)
 	{
 		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
@@ -2165,6 +2273,7 @@ ExecEndModifyTable(ModifyTableState *node)
 		heap_close(pd->reldesc, NoLock);
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
+
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index a5c75e7..1fc7cb2 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -225,6 +225,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..99c8046 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -209,13 +209,12 @@ create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to
 create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
 insert into part_a_1_a_10 values ('a', 1);
 insert into part_b_10_b_20 values ('b', 10);
--- fail
+-- fail (row movement happens only within the partition subtree)
 update part_a_1_a_10 set a = 'b' where a = 'a';
 ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
 DETAIL:  Failing row contains (b, 1).
+-- ok (row movement)
 update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
 -- ok
 update range_parted set b = b + 1 where b = 10;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..7667793 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -119,8 +119,9 @@ create table part_b_10_b_20 partition of range_parted for values from ('b', 10)
 insert into part_a_1_a_10 values ('a', 1);
 insert into part_b_10_b_20 values ('b', 10);
 
--- fail
+-- fail (row movement happens only within the partition subtree)
 update part_a_1_a_10 set a = 'b' where a = 'a';
+-- ok (row movement)
 update range_parted set b = b - 1 where b = 10;
 -- ok
 update range_parted set b = b + 1 where b = 10;

#35

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Khandekar (#34)

1 attachment(s)

Re: UPDATE of partition key

On 25 March 2017 at 01:34, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
I am yet to handle all of your comments, but meanwhile , attached is

an updated patch, that handles RETURNING.

Earlier it was not working because ExecInsert() did not return any
RETURNING clause. This is because the setup needed to create RETURNIG
projection info for leaf partitions is done in ExecInitModifyTable()
only in case of INSERT. But because it is an UPDATE operation, we have
to do this explicitly as a one-time operation when it is determined
that row-movement is required. This is similar to how we do one-time
setup of mt_partition_dispatch_info. So in the patch, I have moved
this code into a new function ExecInitPartitionReturningProjection(),
and now this is called in ExecInitModifyTable() as well as during row
movement for ExecInsert() processing the returning clause.

Basically we need to do all that is done in ExecInitModifyTable() for
INSERT. There are a couple of other things that I suspect that might
need to be done as part of the missing initialization for Execinsert()
during row-movement :
1. Junk filter handling
2. WITH CHECK OPTION

Attached is an another updated patch v4 which does WITH-CHECK-OPTION
related initialization.

So we now have below two function calls during row movement :
/* Build WITH CHECK OPTION constraints for leaf partitions */
ExecInitPartitionWithCheckOptions(mtstate, root_rel);

/* Build a projection for each leaf partition rel. */
ExecInitPartitionReturningProjection(mtstate, root_rel);

And these functions are now re-used at two places : In
ExecInitModifyTable() and in row-movement code.
Basically whatever was not being initialized in ExecInitModifyTable()
is now done in row-movement code.

I have added relevant scenarios in sql/update.sql.

I checked the junk filter handling. I think there isn't anything that
needs to be done, because for INSERT, all that is needed is
ExecCheckPlanOutput(). And this function is anyway called even in
ExecInitModifyTable() even for UPDATE, so we don't have to initialize
anything additional.

Yet, ExecDelete() during row-movement is still returning the RETURNING
result redundantly, which I am yet to handle this.

Done above. Now we have a new parameter in ExecDelete() which tells
whether to skip RETURNING.

On 23 March 2017 at 07:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Would it be better to have at least some new tests?

Added some more scenarios in update.sql. Also have included scenarios
for WITH-CHECK-OPTION for updatable views.

Also, there are a few places in the documentation mentioning that such updates cause error,
which will need to be updated. Perhaps also add some explanatory notes
about the mechanism (delete+insert), trigger behavior, caveats, etc.
There were some points discussed upthread that could be mentioned in the
documentation.

Yeah, I agree. Documentation needs some important changes. I am still
working on them.

+            if (!mtstate->mt_partition_dispatch_info)
+            {

The if (pointer == NULL) style is better perhaps.

+                /* root table RT index is at the head of partitioned_rels */
+                if (node->partitioned_rels)
+                {
+                    Index   root_rti;
+                    Oid     root_oid;
+
+                    root_rti = linitial_int(node->partitioned_rels);
+                    root_oid = getrelid(root_rti, estate->es_range_table);
+                    root_rel = heap_open(root_oid, NoLock); /* locked by
InitPlan */
+                }
+                else
+                    root_rel = mtstate->resultRelInfo->ri_RelationDesc;

Some explanatory comments here might be good, for example, explain in what
situations node->partitioned_rels would not have been set and/or vice versa.

Added some more comments in the relevant if conditions.

Now, for
UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
movement is needed.

We have to open an extra relation for the root partition, and keep it
opened and its handle stored in
mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
this if it is different from node->resultRelInfo->ri_RelationDesc. If
it is same as node->resultRelInfo, it should not be closed because it
gets closed as part of ExecEndPlan().

I guess you're referring to the following hunk. Some comments:
@@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node)
* Close all the partitioned tables, leaf partitions, and their indices
*
* Remember node->mt_partition_dispatch_info[0] corresponds to the root
-     * partitioned table, which we must not try to close, because it is the
-     * main target table of the query that will be closed by ExecEndPlan().
-     * Also, tupslot is NULL for the root partitioned table.
+     * partitioned table, which should not be closed if it is the main target
+     * table of the query, which will be closed by ExecEndPlan().
The last part could be written as: because it will be closed by ExecEndPlan().

Actually I later realized that the relation is not required to be kept
open until ExecEndmodifyTable(). So I reverted the above changes. Now
it is immediately closed once all the row-movement-related setup is
done.

Also, tupslot
+     * is NULL for the root partitioned table.
*/
+    if (node->mt_num_dispatch > 0)
+    {
+        Relation    root_partition;
root_relation?
+
+        root_partition = node->mt_partition_dispatch_info[0]->reldesc;
+        if (root_partition != node->resultRelInfo->ri_RelationDesc)
+            heap_close(root_partition, NoLock);
+    }
It might be a good idea to Assert inside the if block above that
node->operation != CMD_INSERT. Perhaps, also reflect that in the comment
above so that it's clearer.

This does not apply now since I reverted as mentioned above.

Looking at that, I found that in the current patch, if there is no
row-movement happening, ExecPartitionCheck() effectively gets called
twice : First time when ExecPartitionCheck() is explicitly called for
row-movement-required check, and second time in ExecConstraints()
call. May be there should be 2 separate functions
ExecCheckConstraints() and ExecPartitionConstraints(), and also
ExecCheckConstraints() that just calls both. This way we can call the
appropriate functions() accordingly in row-movement case, and the
other callers would continue to call ExecConstraints().

One random idea: we could add a bool ri_PartitionCheckOK which is set to
true after it is checked in ExecConstraints(). And modify the condition
in ExecConstraints() as follows:

if (resultRelInfo->ri_PartitionCheck &&
+ !resultRelInfo->ri_PartitionCheckOK &&
!ExecPartitionCheck(resultRelInfo, slot, estate))

I have taken out the part in ExecConstraints where it forms and emits
partition constraint error message, and put in new function
ExecPartitionCheckEmitError(), and this is called in ExecConstraints()
as well as in ExecUpdate() when it finds that it is not a partitioned
table. This happens when the UPDATE has been run on a leaf partition,
and when ExecPartitionCheck() fails for the leaf partition. Here, we
just need to emit the same error message that ExecConstraint() emits.

Attachments:

update-partition-key_v4.patchapplication/octet-stream; name=update-partition-key_v4.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ab59be8..8f172b3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2650,7 +2650,7 @@ CopyFrom(CopyState cstate)
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr ||
 					resultRelInfo->ri_PartitionCheck)
-					ExecConstraints(resultRelInfo, slot, oldslot, estate);
+					ExecConstraints(resultRelInfo, slot, oldslot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index f2995f2..2912054 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1778,7 +1778,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
@@ -1815,8 +1815,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing,
@@ -1826,7 +1826,7 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
 				TupleTableSlot *slot, TupleTableSlot *orig_slot,
-				EState *estate)
+				EState *estate, bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1913,33 +1913,51 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck &&
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
 		!ExecPartitionCheck(resultRelInfo, slot, estate))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+		ExecPartitionCheckEmitError(resultRelInfo, orig_slot, estate);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
-		{
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-		}
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ *
+ * 'orig_slot' contains the original tuple to be shown in the error message.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *orig_slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 orig_slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 orig_slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index f20d728..2f76140 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+			ExecConstraints(resultRelInfo, slot, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+			ExecConstraints(resultRelInfo, slot, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0b524e0..ff20a18 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -62,7 +62,10 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+											  Relation root_rel);
+static void ExecInitPartitionReturningProjection(ModifyTableState *mtstate,
+												 Relation root_rel);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -435,7 +438,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * Check the constraints of the tuple
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, oldslot, estate);
+			ExecConstraints(resultRelInfo, slot, oldslot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -625,6 +628,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -633,6 +638,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -776,6 +784,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -799,8 +809,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -878,7 +888,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -987,13 +998,86 @@ lreplace:;
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
 
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	is_partitioned_table = true;
+
+			if (mtstate->mt_partition_dispatch_info == NULL)
+			{
+				ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+				Relation root_rel;
+
+				/*
+				 * If this is a partitioned table, we need to open the root
+				 * table RT index which is at the head of partitioned_rels
+				 */
+				if (node->partitioned_rels)
+				{
+					Index	root_rti;
+					Oid		root_oid;
+
+					root_rti = linitial_int(node->partitioned_rels);
+					root_oid = getrelid(root_rti, estate->es_range_table);
+					root_rel = heap_open(root_oid, NoLock);	/* locked by InitPlan */
+				}
+				else /* this may be a leaf partition */
+					root_rel = mtstate->resultRelInfo->ri_RelationDesc;
+
+				is_partitioned_table =
+					root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+				if (is_partitioned_table)
+					ExecSetupPartitionTupleRouting(
+										root_rel,
+										&mtstate->mt_partition_dispatch_info,
+										&mtstate->mt_partitions,
+										&mtstate->mt_partition_tupconv_maps,
+										&mtstate->mt_partition_tuple_slot,
+										&mtstate->mt_num_dispatch,
+										&mtstate->mt_num_partitions);
+
+				/* Build WITH CHECK OPTION constraints for leaf partitions */
+				ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+
+				/* Build a projection for each leaf partition rel. */
+				ExecInitPartitionReturningProjection(mtstate, root_rel);
+
+				/* Close the root partitioned rel if we opened it above. */
+				if (root_rel != mtstate->resultRelInfo->ri_RelationDesc)
+					heap_close(root_rel, NoLock);
+			}
+
+			if (is_partitioned_table)
+			{
+				bool	concurrently_deleted;
+
+				/*
+				 * Skip RETURNING processing for DELETE. We want to return rows
+				 * from INSERT.
+				 */
+				ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+						   &concurrently_deleted, false, false);
+
+				if (concurrently_deleted)
+					return NULL;
+
+				return ExecInsert(mtstate, slot, planSlot, NULL,
+									  ONCONFLICT_NONE, estate, canSetTag);
+			}
+
+			/* It's not a partitioned table after all; error out. */
+			ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+		}
+
 		/*
-		 * Check the constraints of the tuple.  Note that we pass the same
+		 * Check the constraints of the tuple. Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already run partition constraints above, so skip them below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1313,7 +1397,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1583,12 +1667,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1790,44 +1875,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Build WITH CHECK OPTION constraints for each leaf partition rel.
-	 * Note that we didn't build the withCheckOptionList for each partition
-	 * within the planner, but simple translation of the varattnos for each
-	 * partition will suffice.  This only occurs for the INSERT case;
-	 * UPDATE/DELETE cases are handled above.
+	 * Build WITH CHECK OPTION constraints for each leaf partition rel. This
+	 * only occurs for INSERT case; UPDATE/DELETE are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
-	{
-		List		*wcoList;
-
-		Assert(operation == CMD_INSERT);
-		resultRelInfo = mtstate->mt_partitions;
-		wcoList = linitial(node->withCheckOptionLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
-			List	   *wcoExprs = NIL;
-			ListCell   *ll;
-
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
-			{
-				WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
-				ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
-												   mtstate->mt_plans[i]);
-
-				wcoExprs = lappend(wcoExprs, wcoExpr);
-			}
-
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
-			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
-		}
-	}
+	ExecInitPartitionWithCheckOptions(mtstate, rel);
 
 	/*
 	 * Initialize RETURNING projections if needed.
@@ -1836,7 +1887,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1870,28 +1920,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		}
 
 		/*
-		 * Build a projection for each leaf partition rel.  Note that we
-		 * didn't build the returningList for each partition within the
-		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * Build a projection for each leaf partition rel. This only occurs for
+		 * the INSERT case; UPDATE/DELETE are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *rlist;
-
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
-			resultRelInfo->ri_projectReturning =
-				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
-									 resultRelInfo->ri_RelationDesc->rd_att);
-			resultRelInfo++;
-		}
+		ExecInitPartitionReturningProjection(mtstate, rel);
 	}
 	else
 	{
@@ -2118,6 +2150,104 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 }
 
 /* ----------------------------------------------------------------
+ *		ExecInitPartitionWithCheckOptions
+ *
+ * Build WITH CHECK OPTION constraints for each leaf partition rel.
+ * Note that we don't build the withCheckOptionList for each partition
+ * within the planner, but simple translation of the varattnos for each
+ * partition suffices. This only occurs for the INSERT case; UPDATE/DELETE
+ * cases are handled separately.
+ * ----------------------------------------------------------------
+ */
+
+static void
+ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	List		*wcoList;
+	int			i;
+
+	if (node->withCheckOptionLists == NIL || mtstate->mt_num_partitions == 0)
+		return;
+
+	wcoList = linitial(node->withCheckOptionLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *mapped_wcoList;
+		List	   *wcoExprs = NIL;
+		ListCell   *ll;
+
+		/* varno = node->nominalRelation */
+		mapped_wcoList = map_partition_varattnos(wcoList,
+												 node->nominalRelation,
+												 partrel, root_rel);
+		foreach(ll, mapped_wcoList)
+		{
+			WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
+			ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
+										   mtstate->mt_plans[i]);
+
+			wcoExprs = lappend(wcoExprs, wcoExpr);
+		}
+
+		resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+		resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
+		resultRelInfo++;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartitionReturningProjection
+ *
+ * Initialize stuff required to handle RETURNING for leaf partitions.
+ * We don't build the returningList for each partition within the planner, but
+ * simple translation of the varattnos for each partition suffices.  This
+ * actually is helpful only for INSERT case; UPDATE/DELETE are handled
+ * differently.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecInitPartitionReturningProjection(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	TupleTableSlot *returning_slot = mtstate->ps.ps_ResultTupleSlot;
+	List		   *returningList;
+	int				i;
+
+	/*
+	 * If there is no returning clause, or if we have already initialized the
+	 * returning projection info, there is nothing to be done.
+	 */
+	if (node->returningLists == NIL ||
+		(resultRelInfo && resultRelInfo->ri_projectReturning != NULL) ||
+		mtstate->mt_num_partitions == 0)
+		return;
+
+	returningList = linitial(node->returningLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *rlist;
+
+		/* varno = node->nominalRelation */
+		rlist = map_partition_varattnos(returningList,
+										node->nominalRelation,
+										partrel, root_rel);
+		resultRelInfo->ri_projectReturning =
+			ExecBuildProjectionInfo(rlist,
+									mtstate->ps.ps_ExprContext,
+									returning_slot,
+									&mtstate->ps,
+									resultRelInfo->ri_RelationDesc->rd_att);
+		resultRelInfo++;
+	}
+}
+
+
+/* ----------------------------------------------------------------
  *		ExecEndModifyTable
  *
  *		Shuts down the plan.
@@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node)
 		heap_close(pd->reldesc, NoLock);
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
+
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d3849b9..102fc97 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,9 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
 				TupleTableSlot *slot, TupleTableSlot *orig_slot,
-				EState *estate);
+				EState *estate, bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *orig_slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -216,6 +218,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..a56afab 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,121 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (b, 12, 116).
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ a | 1 |    
+ a | 4 | 200
+(2 rows)
+
+select * from part_a_10_a_20 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_b_1_b_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ b | 7 | 117
+ b | 9 | 125
+(2 rows)
+
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+(2 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+ a | 1 |  
+(1 row)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+ b | 15 | 199
+(3 rows)
+
 -- cleanup
+drop view upview;
 drop table range_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..cda9906 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,61 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
-
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
-
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_a_10_a_20 order by 1, 2, 3;
+select * from part_b_1_b_10 order by 1, 2, 3;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
 -- cleanup
+drop view upview;
 drop table range_parted;

#36

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Khandekar (#35)

1 attachment(s)

Re: UPDATE of partition key

On 27 March 2017 at 13:05, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Also, there are a few places in the documentation mentioning that such updates cause error,
which will need to be updated. Perhaps also add some explanatory notes
about the mechanism (delete+insert), trigger behavior, caveats, etc.
There were some points discussed upthread that could be mentioned in the
documentation.
Yeah, I agree. Documentation needs some important changes. I am still
working on them.

Attached patch v5 has above required doc changes added.

In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
removed the caveat of not being able to update partition key. And it
is now replaced by the caveat where an update/delete operations can
silently miss a row when there is a concurrent UPDATE of partition-key
happening.

UPDATE row movement behaviour is described in :
Part VI "Reference => SQL Commands => UPDATE

On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition? It seems weird to run BR
update triggers but not AR update triggers. Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

I checked the trigger behaviour in case of UPSERT. Here, when there is
conflict found, ExecOnConflictUpdate() is called, and then the
function returns immediately, which means AR INSERT trigger will not
fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
and AR UPDATE triggers will be fired. So in short, when an INSERT
becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
and AR UPDATE also get fired. On the same lines, it makes sense in
case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
the original table, and then the BR and AR DELETE/INSERT triggers on
the respective tables.

So the common policy can be :
Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending
upon what the statement is.
If there is a change in the operation, according to what the operation
is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the
respective triggers would be fired.

The current patch already has the behaviour as per above policy. So I
have included the description of this trigger related behaviour in the
"Overview of Trigger Behavior" section of the docs. This has been
derived from the way it is written for trigger behaviour for UPSERT in
the preceding section.

Attachments:

update-partition-key_v5.patchapplication/octet-stream; name=update-partition-key_v5.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index d1e915c11..a3ee3fa 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3845,10 +3845,21 @@ ANALYZE measurement;
    <itemizedlist>
     <listitem>
      <para>
-      An <command>UPDATE</> that causes a row to move from one partition to
-      another fails, because the new value of the row fails to satisfy the
-      implicit partition constraint of the original partition.  This might
-      change in future releases.
+      An <command>UPDATE</> causes a row to move from one partition to another
+      if the new value of the row fails to satisfy the implicit partition
+      constraint of the original partition but there is another partition which
+      can fit this row. During such a row movement, suppose there is another
+      concurrent session for which this row is still visible, and it is about
+      to do an <command>UPDATE</> or <command>DELETE</> operation on the same
+      row. This DML operation can silently miss this row if the row now gets
+      deleted from the partition by the first session as part of its
+      <command>UPDATE</> row movement. In such case, the concurrent
+      <command>UPDATE</>/<command>DELETE</>, being unaware of the row movement,
+      interprets that the row has just been deleted so there is nothing to be
+      done for this row. Whereas, in the usual case where the table is not
+      partitioned, or where there is no row movement, the second session would
+      have identified the newly updated row and carried
+      <command>UPDATE</>/<command>DELETE</> on this new row version.
      </para>
     </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..feb1c3e 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,13 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 8f724c8..4bb434a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -151,6 +151,33 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it is possible that all row-level
+    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
+    triggers are applied on the respective partitions in a way that is apparent
+    from the final state of the updated row. This is because the
+    <command>UPDATE</command> is done by doing a <command>DELETE</command> on
+    the original partition and an <command>INSERT</command> on the partition
+    where the row is moved. So a <literal>BEFORE</> <command>UPDATE</command>
+    trigger followed by <literal>BEFORE</> <command>DELETE</command> trigger
+    are applied if defined for the original partition, followed by
+    <literal>BEFORE</> <command>INSERT</command> trigger if defined on the
+    destination partition. The possibility of surprising outcomes should be
+    considered when all these triggers affect the row being moved. As far as
+    <literal>AFTER ROW</> triggers are concerned, <literal>AFTER</>
+    <command>DELETE</command> and <literal>AFTER</> <command>INSERT</command>
+    triggers are applied; but <literal>AFTER</> <command>UPDATE</command>
+    triggers are not applied because the <command>UPDATE</command> is converted
+    to a <command>DELETE</command> and <command>UPDATE</command>. As far as
+    statement-level triggers are concerned, if row movement happens, there
+    would not be any <command>DELETE</command> or <command>INSERT</command>
+    triggers applied. Only the <command>UPDATE</command> triggers defined on
+    the main target table used in the <command>UPDATE</command> statement will
+    be applied.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ab59be8..8f172b3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2650,7 +2650,7 @@ CopyFrom(CopyState cstate)
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr ||
 					resultRelInfo->ri_PartitionCheck)
-					ExecConstraints(resultRelInfo, slot, oldslot, estate);
+					ExecConstraints(resultRelInfo, slot, oldslot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index f2995f2..2912054 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1778,7 +1778,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
@@ -1815,8 +1815,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing,
@@ -1826,7 +1826,7 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
 				TupleTableSlot *slot, TupleTableSlot *orig_slot,
-				EState *estate)
+				EState *estate, bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1913,33 +1913,51 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck &&
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
 		!ExecPartitionCheck(resultRelInfo, slot, estate))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+		ExecPartitionCheckEmitError(resultRelInfo, orig_slot, estate);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
-		{
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-		}
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ *
+ * 'orig_slot' contains the original tuple to be shown in the error message.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *orig_slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 orig_slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 orig_slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index f20d728..2f76140 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+			ExecConstraints(resultRelInfo, slot, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+			ExecConstraints(resultRelInfo, slot, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0b524e0..ff20a18 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -62,7 +62,10 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+											  Relation root_rel);
+static void ExecInitPartitionReturningProjection(ModifyTableState *mtstate,
+												 Relation root_rel);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -435,7 +438,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * Check the constraints of the tuple
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, oldslot, estate);
+			ExecConstraints(resultRelInfo, slot, oldslot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -625,6 +628,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -633,6 +638,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -776,6 +784,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -799,8 +809,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -878,7 +888,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -987,13 +998,86 @@ lreplace:;
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
 
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	is_partitioned_table = true;
+
+			if (mtstate->mt_partition_dispatch_info == NULL)
+			{
+				ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+				Relation root_rel;
+
+				/*
+				 * If this is a partitioned table, we need to open the root
+				 * table RT index which is at the head of partitioned_rels
+				 */
+				if (node->partitioned_rels)
+				{
+					Index	root_rti;
+					Oid		root_oid;
+
+					root_rti = linitial_int(node->partitioned_rels);
+					root_oid = getrelid(root_rti, estate->es_range_table);
+					root_rel = heap_open(root_oid, NoLock);	/* locked by InitPlan */
+				}
+				else /* this may be a leaf partition */
+					root_rel = mtstate->resultRelInfo->ri_RelationDesc;
+
+				is_partitioned_table =
+					root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+				if (is_partitioned_table)
+					ExecSetupPartitionTupleRouting(
+										root_rel,
+										&mtstate->mt_partition_dispatch_info,
+										&mtstate->mt_partitions,
+										&mtstate->mt_partition_tupconv_maps,
+										&mtstate->mt_partition_tuple_slot,
+										&mtstate->mt_num_dispatch,
+										&mtstate->mt_num_partitions);
+
+				/* Build WITH CHECK OPTION constraints for leaf partitions */
+				ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+
+				/* Build a projection for each leaf partition rel. */
+				ExecInitPartitionReturningProjection(mtstate, root_rel);
+
+				/* Close the root partitioned rel if we opened it above. */
+				if (root_rel != mtstate->resultRelInfo->ri_RelationDesc)
+					heap_close(root_rel, NoLock);
+			}
+
+			if (is_partitioned_table)
+			{
+				bool	concurrently_deleted;
+
+				/*
+				 * Skip RETURNING processing for DELETE. We want to return rows
+				 * from INSERT.
+				 */
+				ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+						   &concurrently_deleted, false, false);
+
+				if (concurrently_deleted)
+					return NULL;
+
+				return ExecInsert(mtstate, slot, planSlot, NULL,
+									  ONCONFLICT_NONE, estate, canSetTag);
+			}
+
+			/* It's not a partitioned table after all; error out. */
+			ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+		}
+
 		/*
-		 * Check the constraints of the tuple.  Note that we pass the same
+		 * Check the constraints of the tuple. Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already run partition constraints above, so skip them below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1313,7 +1397,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1583,12 +1667,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1790,44 +1875,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Build WITH CHECK OPTION constraints for each leaf partition rel.
-	 * Note that we didn't build the withCheckOptionList for each partition
-	 * within the planner, but simple translation of the varattnos for each
-	 * partition will suffice.  This only occurs for the INSERT case;
-	 * UPDATE/DELETE cases are handled above.
+	 * Build WITH CHECK OPTION constraints for each leaf partition rel. This
+	 * only occurs for INSERT case; UPDATE/DELETE are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
-	{
-		List		*wcoList;
-
-		Assert(operation == CMD_INSERT);
-		resultRelInfo = mtstate->mt_partitions;
-		wcoList = linitial(node->withCheckOptionLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
-			List	   *wcoExprs = NIL;
-			ListCell   *ll;
-
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
-			{
-				WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
-				ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
-												   mtstate->mt_plans[i]);
-
-				wcoExprs = lappend(wcoExprs, wcoExpr);
-			}
-
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
-			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
-		}
-	}
+	ExecInitPartitionWithCheckOptions(mtstate, rel);
 
 	/*
 	 * Initialize RETURNING projections if needed.
@@ -1836,7 +1887,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1870,28 +1920,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		}
 
 		/*
-		 * Build a projection for each leaf partition rel.  Note that we
-		 * didn't build the returningList for each partition within the
-		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * Build a projection for each leaf partition rel. This only occurs for
+		 * the INSERT case; UPDATE/DELETE are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *rlist;
-
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
-			resultRelInfo->ri_projectReturning =
-				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
-									 resultRelInfo->ri_RelationDesc->rd_att);
-			resultRelInfo++;
-		}
+		ExecInitPartitionReturningProjection(mtstate, rel);
 	}
 	else
 	{
@@ -2118,6 +2150,104 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 }
 
 /* ----------------------------------------------------------------
+ *		ExecInitPartitionWithCheckOptions
+ *
+ * Build WITH CHECK OPTION constraints for each leaf partition rel.
+ * Note that we don't build the withCheckOptionList for each partition
+ * within the planner, but simple translation of the varattnos for each
+ * partition suffices. This only occurs for the INSERT case; UPDATE/DELETE
+ * cases are handled separately.
+ * ----------------------------------------------------------------
+ */
+
+static void
+ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	List		*wcoList;
+	int			i;
+
+	if (node->withCheckOptionLists == NIL || mtstate->mt_num_partitions == 0)
+		return;
+
+	wcoList = linitial(node->withCheckOptionLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *mapped_wcoList;
+		List	   *wcoExprs = NIL;
+		ListCell   *ll;
+
+		/* varno = node->nominalRelation */
+		mapped_wcoList = map_partition_varattnos(wcoList,
+												 node->nominalRelation,
+												 partrel, root_rel);
+		foreach(ll, mapped_wcoList)
+		{
+			WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
+			ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
+										   mtstate->mt_plans[i]);
+
+			wcoExprs = lappend(wcoExprs, wcoExpr);
+		}
+
+		resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+		resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
+		resultRelInfo++;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartitionReturningProjection
+ *
+ * Initialize stuff required to handle RETURNING for leaf partitions.
+ * We don't build the returningList for each partition within the planner, but
+ * simple translation of the varattnos for each partition suffices.  This
+ * actually is helpful only for INSERT case; UPDATE/DELETE are handled
+ * differently.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecInitPartitionReturningProjection(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	TupleTableSlot *returning_slot = mtstate->ps.ps_ResultTupleSlot;
+	List		   *returningList;
+	int				i;
+
+	/*
+	 * If there is no returning clause, or if we have already initialized the
+	 * returning projection info, there is nothing to be done.
+	 */
+	if (node->returningLists == NIL ||
+		(resultRelInfo && resultRelInfo->ri_projectReturning != NULL) ||
+		mtstate->mt_num_partitions == 0)
+		return;
+
+	returningList = linitial(node->returningLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *rlist;
+
+		/* varno = node->nominalRelation */
+		rlist = map_partition_varattnos(returningList,
+										node->nominalRelation,
+										partrel, root_rel);
+		resultRelInfo->ri_projectReturning =
+			ExecBuildProjectionInfo(rlist,
+									mtstate->ps.ps_ExprContext,
+									returning_slot,
+									&mtstate->ps,
+									resultRelInfo->ri_RelationDesc->rd_att);
+		resultRelInfo++;
+	}
+}
+
+
+/* ----------------------------------------------------------------
  *		ExecEndModifyTable
  *
  *		Shuts down the plan.
@@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node)
 		heap_close(pd->reldesc, NoLock);
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
+
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d3849b9..102fc97 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,9 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
 				TupleTableSlot *slot, TupleTableSlot *orig_slot,
-				EState *estate);
+				EState *estate, bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *orig_slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -216,6 +218,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..a56afab 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,121 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (b, 12, 116).
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ a | 1 |    
+ a | 4 | 200
+(2 rows)
+
+select * from part_a_10_a_20 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_b_1_b_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ b | 7 | 117
+ b | 9 | 125
+(2 rows)
+
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+(2 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+ a | 1 |  
+(1 row)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+ b | 15 | 199
+(3 rows)
+
 -- cleanup
+drop view upview;
 drop table range_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..cda9906 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,61 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
-
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
-
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_a_10_a_20 order by 1, 2, 3;
+select * from part_b_1_b_10 order by 1, 2, 3;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
 -- cleanup
+drop view upview;
 drop table range_parted;

#37

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Khandekar (#36)

Re: UPDATE of partition key

Hi Amit,

Thanks for the updated patches.

On 2017/03/28 19:12, Amit Khandekar wrote:

On 27 March 2017 at 13:05, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Also, there are a few places in the documentation mentioning that such updates cause error,
which will need to be updated. Perhaps also add some explanatory notes
about the mechanism (delete+insert), trigger behavior, caveats, etc.
There were some points discussed upthread that could be mentioned in the
documentation.
Yeah, I agree. Documentation needs some important changes. I am still
working on them.

Attached patch v5 has above required doc changes added.

In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
removed the caveat of not being able to update partition key. And it
is now replaced by the caveat where an update/delete operations can
silently miss a row when there is a concurrent UPDATE of partition-key
happening.

Hmm, how about just removing the "partition-changing updates are
disallowed" caveat from the list on the 5.11 Partitioning page and explain
the concurrency-related caveats on the UPDATE reference page?

UPDATE row movement behaviour is described in :
Part VI "Reference => SQL Commands => UPDATE

On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition? It seems weird to run BR
update triggers but not AR update triggers. Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

I checked the trigger behaviour in case of UPSERT. Here, when there is
conflict found, ExecOnConflictUpdate() is called, and then the
function returns immediately, which means AR INSERT trigger will not
fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
and AR UPDATE triggers will be fired. So in short, when an INSERT
becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
and AR UPDATE also get fired. On the same lines, it makes sense in
case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
the original table, and then the BR and AR DELETE/INSERT triggers on
the respective tables.

So the common policy can be :
Fire the BR trigger. It can be INESRT/UPDATE/DELETE trigger depending
upon what the statement is.
If there is a change in the operation, according to what the operation
is converted to (UPDATE->DELETE+INSERT or INSERT->UPDATE), all the
respective triggers would be fired.

The current patch already has the behaviour as per above policy. So I
have included the description of this trigger related behaviour in the
"Overview of Trigger Behavior" section of the docs. This has been
derived from the way it is written for trigger behaviour for UPSERT in
the preceding section.

I tested how various row-level triggers behave and it all seems to work as
you have described in your various messages, which the latest patch also
documents.

Some comments on the patch itself:

-      An <command>UPDATE</> that causes a row to move from one partition to
-      another fails, because the new value of the row fails to satisfy the
-      implicit partition constraint of the original partition.  This might
-      change in future releases.
+      An <command>UPDATE</> causes a row to move from one partition to
another
+      if the new value of the row fails to satisfy the implicit partition
<snip>

As mentioned above, we can simply remove this item from the list of
caveats on ddl.sgml. The new text can be moved to the Notes portion of
the UPDATE reference page.

+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it is possible that all row-level
+    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
+    triggers are applied on the respective partitions in a way that is
apparent
+    from the final state of the updated row.

How about dropping "it is possible that" from this sentence?

+ <command>UPDATE</command> is done by doing a <command>DELETE</command> on

How about: s/is done/is performed/g

+    triggers are not applied because the <command>UPDATE</command> is
converted
+    to a <command>DELETE</command> and <command>UPDATE</command>.

I think you meant DELETE and INSERT.

+        if (resultRelInfo->ri_PartitionCheck &&
+            !ExecPartitionCheck(resultRelInfo, slot, estate))
+        {

How about a one-line comment what this block of code does?

-         * Check the constraints of the tuple.  Note that we pass the same
+         * Check the constraints of the tuple. Note that we pass the same

I think that this hunk is not necessary. (I've heard that two spaces
after a sentence-ending period is not a problem [1]https://www.python.org/dev/peps/pep-0008/#comments.)

+ * We have already run partition constraints above, so skip them
below.

How about: s/run/checked the/g?

@@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node)
heap_close(pd->reldesc, NoLock);
ExecDropSingleTupleTableSlot(pd->tupslot);
}
+
for (i = 0; i < node->mt_num_partitions; i++)
{
ResultRelInfo *resultRelInfo = node->mt_partitions + i;

Needless hunk.

Overall, I think the patch looks good now. Thanks again for working on it.

Thanks,
Amit

[1]: https://www.python.org/dev/peps/pep-0008/#comments

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Khandekar (#1)

1 attachment(s)

Re: UPDATE of partition key

For some reason, my reply got sent to only Amit Langote instead of
reply-to-all. Below is the mail reply. Thanks Amit Langote for
bringing this to my notice.

On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/03/28 19:12, Amit Khandekar wrote:

In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
removed the caveat of not being able to update partition key. And it
is now replaced by the caveat where an update/delete operations can
silently miss a row when there is a concurrent UPDATE of partition-key
happening.

Hmm, how about just removing the "partition-changing updates are
disallowed" caveat from the list on the 5.11 Partitioning page and explain
the concurrency-related caveats on the UPDATE reference page?

IMHO this caveat is better placed in Partitioning chapter to emphasize
that it is a drawback specifically in presence of partitioning.
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it is possible that all row-level
+    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
+    triggers are applied on the respective partitions in a way that is
apparent
+    from the final state of the updated row.
How about dropping "it is possible that" from this sentence?
What the statement means is : "It is true that all triggers are
applied on the respective partitions; but it is possible that they are
applied in a way that is apparent from final state of the updated
row". So "possible" applies to "in a way that is apparent..". It
means, the user should be aware that all the triggers can change the
row and so the final row will be affected by all those triggers.
Actually, we have a similar statement for UPSERT involved with
triggers in the preceding section. I have taken the statement from
there.

+ <command>UPDATE</command> is done by doing a <command>DELETE</command> on

How about: s/is done/is performed/g

Done.
+    triggers are not applied because the <command>UPDATE</command> is
converted
+    to a <command>DELETE</command> and <command>UPDATE</command>.
I think you meant DELETE and INSERT.
Oops. Corrected.
+        if (resultRelInfo->ri_PartitionCheck &&
+            !ExecPartitionCheck(resultRelInfo, slot, estate))
+        {
How about a one-line comment what this block of code does?
Yes, this was needed. Added a comment.
-         * Check the constraints of the tuple.  Note that we pass the same
+         * Check the constraints of the tuple. Note that we pass the same
I think that this hunk is not necessary. (I've heard that two spaces
after a sentence-ending period is not a problem [1].)
Actually I accidentally removed one space, thinking that it was one of
my own comments. Reverted back this change, since it is a needless
hunk.

+ * We have already run partition constraints above, so skip them below.

How about: s/run/checked the/g?

Done.

@@ -2159,6 +2289,7 @@ ExecEndModifyTable(ModifyTableState *node)
heap_close(pd->reldesc, NoLock);
ExecDropSingleTupleTableSlot(pd->tupslot);
}
+
for (i = 0; i < node->mt_num_partitions; i++)
{
ResultRelInfo *resultRelInfo = node->mt_partitions + i;

Needless hunk.

Right. Removed.

Overall, I think the patch looks good now. Thanks again for working on it.

Thanks Amit for your efforts in reviewing the patch. Attached is v6
patch that contains above points handled.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v6.patchapplication/octet-stream; name=update-partition-key_v6.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index d1e915c11..a3ee3fa 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3845,10 +3845,21 @@ ANALYZE measurement;
    <itemizedlist>
     <listitem>
      <para>
-      An <command>UPDATE</> that causes a row to move from one partition to
-      another fails, because the new value of the row fails to satisfy the
-      implicit partition constraint of the original partition.  This might
-      change in future releases.
+      An <command>UPDATE</> causes a row to move from one partition to another
+      if the new value of the row fails to satisfy the implicit partition
+      constraint of the original partition but there is another partition which
+      can fit this row. During such a row movement, suppose there is another
+      concurrent session for which this row is still visible, and it is about
+      to do an <command>UPDATE</> or <command>DELETE</> operation on the same
+      row. This DML operation can silently miss this row if the row now gets
+      deleted from the partition by the first session as part of its
+      <command>UPDATE</> row movement. In such case, the concurrent
+      <command>UPDATE</>/<command>DELETE</>, being unaware of the row movement,
+      interprets that the row has just been deleted so there is nothing to be
+      done for this row. Whereas, in the usual case where the table is not
+      partitioned, or where there is no row movement, the second session would
+      have identified the newly updated row and carried
+      <command>UPDATE</>/<command>DELETE</> on this new row version.
      </para>
     </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..feb1c3e 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,13 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 8f724c8..c343825 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -151,6 +151,33 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it is possible that all row-level
+    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
+    triggers are applied on the respective partitions in a way that is apparent
+    from the final state of the updated row. This is because the
+    <command>UPDATE</command> is performed by doing a <command>DELETE</command>
+    on the original partition and an <command>INSERT</command> on the partition
+    where the row is moved. So a <literal>BEFORE</> <command>UPDATE</command>
+    trigger followed by <literal>BEFORE</> <command>DELETE</command> trigger
+    are applied if defined for the original partition, followed by
+    <literal>BEFORE</> <command>INSERT</command> trigger if defined on the
+    destination partition. The possibility of surprising outcomes should be
+    considered when all these triggers affect the row being moved. As far as
+    <literal>AFTER ROW</> triggers are concerned, <literal>AFTER</>
+    <command>DELETE</command> and <literal>AFTER</> <command>INSERT</command>
+    triggers are applied; but <literal>AFTER</> <command>UPDATE</command>
+    triggers are not applied because the <command>UPDATE</command> is converted
+    to a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, if row movement happens, there
+    would not be any <command>DELETE</command> or <command>INSERT</command>
+    triggers applied. Only the <command>UPDATE</command> triggers defined on
+    the main target table used in the <command>UPDATE</command> statement will
+    be applied.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 0158eda..3de4411 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2650,7 +2650,7 @@ CopyFrom(CopyState cstate)
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr ||
 					resultRelInfo->ri_PartitionCheck)
-					ExecConstraints(resultRelInfo, slot, oldslot, estate);
+					ExecConstraints(resultRelInfo, slot, oldslot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index f2995f2..2912054 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1778,7 +1778,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
@@ -1815,8 +1815,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing,
@@ -1826,7 +1826,7 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
 				TupleTableSlot *slot, TupleTableSlot *orig_slot,
-				EState *estate)
+				EState *estate, bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1913,33 +1913,51 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck &&
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
 		!ExecPartitionCheck(resultRelInfo, slot, estate))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+		ExecPartitionCheckEmitError(resultRelInfo, orig_slot, estate);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
-		{
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-		}
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ *
+ * 'orig_slot' contains the original tuple to be shown in the error message.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *orig_slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 orig_slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 orig_slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index f20d728..2f76140 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+			ExecConstraints(resultRelInfo, slot, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+			ExecConstraints(resultRelInfo, slot, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0b524e0..388723b 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -62,7 +62,10 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+											  Relation root_rel);
+static void ExecInitPartitionReturningProjection(ModifyTableState *mtstate,
+												 Relation root_rel);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -435,7 +438,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * Check the constraints of the tuple
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, oldslot, estate);
+			ExecConstraints(resultRelInfo, slot, oldslot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -625,6 +628,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -633,6 +638,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -776,6 +784,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -799,8 +809,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -878,7 +888,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -988,12 +999,90 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	is_partitioned_table = true;
+
+			if (mtstate->mt_partition_dispatch_info == NULL)
+			{
+				ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+				Relation root_rel;
+
+				/*
+				 * If this is a partitioned table, we need to open the root
+				 * table RT index which is at the head of partitioned_rels
+				 */
+				if (node->partitioned_rels)
+				{
+					Index	root_rti;
+					Oid		root_oid;
+
+					root_rti = linitial_int(node->partitioned_rels);
+					root_oid = getrelid(root_rti, estate->es_range_table);
+					root_rel = heap_open(root_oid, NoLock);	/* locked by InitPlan */
+				}
+				else /* this may be a leaf partition */
+					root_rel = mtstate->resultRelInfo->ri_RelationDesc;
+
+				is_partitioned_table =
+					root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+				if (is_partitioned_table)
+					ExecSetupPartitionTupleRouting(
+										root_rel,
+										&mtstate->mt_partition_dispatch_info,
+										&mtstate->mt_partitions,
+										&mtstate->mt_partition_tupconv_maps,
+										&mtstate->mt_partition_tuple_slot,
+										&mtstate->mt_num_dispatch,
+										&mtstate->mt_num_partitions);
+
+				/* Build WITH CHECK OPTION constraints for leaf partitions */
+				ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+
+				/* Build a projection for each leaf partition rel. */
+				ExecInitPartitionReturningProjection(mtstate, root_rel);
+
+				/* Close the root partitioned rel if we opened it above. */
+				if (root_rel != mtstate->resultRelInfo->ri_RelationDesc)
+					heap_close(root_rel, NoLock);
+			}
+
+			if (is_partitioned_table)
+			{
+				bool	concurrently_deleted;
+
+				/*
+				 * Skip RETURNING processing for DELETE. We want to return rows
+				 * from INSERT.
+				 */
+				ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+						   &concurrently_deleted, false, false);
+
+				if (concurrently_deleted)
+					return NULL;
+
+				return ExecInsert(mtstate, slot, planSlot, NULL,
+									  ONCONFLICT_NONE, estate, canSetTag);
+			}
+
+			/* It's not a partitioned table after all; error out. */
+			ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1313,7 +1402,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1583,12 +1672,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1790,44 +1880,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Build WITH CHECK OPTION constraints for each leaf partition rel.
-	 * Note that we didn't build the withCheckOptionList for each partition
-	 * within the planner, but simple translation of the varattnos for each
-	 * partition will suffice.  This only occurs for the INSERT case;
-	 * UPDATE/DELETE cases are handled above.
+	 * Build WITH CHECK OPTION constraints for each leaf partition rel. This
+	 * only occurs for INSERT case; UPDATE/DELETE are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
-	{
-		List		*wcoList;
-
-		Assert(operation == CMD_INSERT);
-		resultRelInfo = mtstate->mt_partitions;
-		wcoList = linitial(node->withCheckOptionLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
-			List	   *wcoExprs = NIL;
-			ListCell   *ll;
-
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
-			{
-				WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
-				ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
-												   mtstate->mt_plans[i]);
-
-				wcoExprs = lappend(wcoExprs, wcoExpr);
-			}
-
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
-			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
-		}
-	}
+	ExecInitPartitionWithCheckOptions(mtstate, rel);
 
 	/*
 	 * Initialize RETURNING projections if needed.
@@ -1836,7 +1892,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1870,28 +1925,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		}
 
 		/*
-		 * Build a projection for each leaf partition rel.  Note that we
-		 * didn't build the returningList for each partition within the
-		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * Build a projection for each leaf partition rel. This only occurs for
+		 * the INSERT case; UPDATE/DELETE are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *rlist;
-
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
-			resultRelInfo->ri_projectReturning =
-				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
-									 resultRelInfo->ri_RelationDesc->rd_att);
-			resultRelInfo++;
-		}
+		ExecInitPartitionReturningProjection(mtstate, rel);
 	}
 	else
 	{
@@ -2118,6 +2155,104 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 }
 
 /* ----------------------------------------------------------------
+ *		ExecInitPartitionWithCheckOptions
+ *
+ * Build WITH CHECK OPTION constraints for each leaf partition rel.
+ * Note that we don't build the withCheckOptionList for each partition
+ * within the planner, but simple translation of the varattnos for each
+ * partition suffices. This only occurs for the INSERT case; UPDATE/DELETE
+ * cases are handled separately.
+ * ----------------------------------------------------------------
+ */
+
+static void
+ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	List		*wcoList;
+	int			i;
+
+	if (node->withCheckOptionLists == NIL || mtstate->mt_num_partitions == 0)
+		return;
+
+	wcoList = linitial(node->withCheckOptionLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *mapped_wcoList;
+		List	   *wcoExprs = NIL;
+		ListCell   *ll;
+
+		/* varno = node->nominalRelation */
+		mapped_wcoList = map_partition_varattnos(wcoList,
+												 node->nominalRelation,
+												 partrel, root_rel);
+		foreach(ll, mapped_wcoList)
+		{
+			WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
+			ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
+										   mtstate->mt_plans[i]);
+
+			wcoExprs = lappend(wcoExprs, wcoExpr);
+		}
+
+		resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+		resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
+		resultRelInfo++;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartitionReturningProjection
+ *
+ * Initialize stuff required to handle RETURNING for leaf partitions.
+ * We don't build the returningList for each partition within the planner, but
+ * simple translation of the varattnos for each partition suffices.  This
+ * actually is helpful only for INSERT case; UPDATE/DELETE are handled
+ * differently.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecInitPartitionReturningProjection(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	TupleTableSlot *returning_slot = mtstate->ps.ps_ResultTupleSlot;
+	List		   *returningList;
+	int				i;
+
+	/*
+	 * If there is no returning clause, or if we have already initialized the
+	 * returning projection info, there is nothing to be done.
+	 */
+	if (node->returningLists == NIL ||
+		(resultRelInfo && resultRelInfo->ri_projectReturning != NULL) ||
+		mtstate->mt_num_partitions == 0)
+		return;
+
+	returningList = linitial(node->returningLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *rlist;
+
+		/* varno = node->nominalRelation */
+		rlist = map_partition_varattnos(returningList,
+										node->nominalRelation,
+										partrel, root_rel);
+		resultRelInfo->ri_projectReturning =
+			ExecBuildProjectionInfo(rlist,
+									mtstate->ps.ps_ExprContext,
+									returning_slot,
+									&mtstate->ps,
+									resultRelInfo->ri_RelationDesc->rd_att);
+		resultRelInfo++;
+	}
+}
+
+
+/* ----------------------------------------------------------------
  *		ExecEndModifyTable
  *
  *		Shuts down the plan.
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d3849b9..102fc97 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,9 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
 				TupleTableSlot *slot, TupleTableSlot *orig_slot,
-				EState *estate);
+				EState *estate, bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *orig_slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -216,6 +218,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..a56afab 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,121 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (b, 12, 116).
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ a | 1 |    
+ a | 4 | 200
+(2 rows)
+
+select * from part_a_10_a_20 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_b_1_b_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ b | 7 | 117
+ b | 9 | 125
+(2 rows)
+
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+(2 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+ a | 1 |  
+(1 row)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+ b | 15 | 199
+(3 rows)
+
 -- cleanup
+drop view upview;
 drop table range_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..cda9906 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,61 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
-
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
-
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_a_10_a_20 order by 1, 2, 3;
+select * from part_b_1_b_10 order by 1, 2, 3;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
 -- cleanup
+drop view upview;
 drop table range_parted;

Import Notes

Reply to msg id not found: CAJ3gD9eVBrLo9GoAB6b5CdjNuTm3BBDdvEFwWKKZ7O4zAHoQ@mail.gmail.com

#39

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Khandekar (#38)

Re: UPDATE of partition key

Hi Amit,

Thanks for updating the patch. Since ddl.sgml got updated on Saturday,
patch needs a rebase.

On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/03/28 19:12, Amit Khandekar wrote:

In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
removed the caveat of not being able to update partition key. And it
is now replaced by the caveat where an update/delete operations can
silently miss a row when there is a concurrent UPDATE of partition-key
happening.

Hmm, how about just removing the "partition-changing updates are
disallowed" caveat from the list on the 5.11 Partitioning page and explain
the concurrency-related caveats on the UPDATE reference page?

IMHO this caveat is better placed in Partitioning chapter to emphasize
that it is a drawback specifically in presence of partitioning.

I mean we fixed things for declarative partitioning so it's no longer a
caveat like it is for partitioning implemented using inheritance (in that
the former doesn't require user-defined triggers to implement
row-movement). Seeing the first sentence, that is:

An <command>UPDATE</> causes a row to move from one partition to another
if the new value of the row fails to satisfy the implicit partition
constraint of the original partition but there is another partition which
can fit this row.

which clearly seems to suggest that row-movement, if required, is handled
by the system. So it's not clear why it's in this list. If we want to
describe the limitations of the current implementation, we'll need to
rephrase it a bit. How about something like:

For an <command>UPDATE</> that causes a row to move from one partition to
another due the partition key being updated, the following caveats exist:
<a brief description of the possibility of surprising results in the
presence of concurrent manipulation of the row in question>

+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it is possible that all row-level
+    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
+    triggers are applied on the respective partitions in a way that is
apparent
+    from the final state of the updated row.
How about dropping "it is possible that" from this sentence?
What the statement means is : "It is true that all triggers are
applied on the respective partitions; but it is possible that they are
applied in a way that is apparent from final state of the updated
row". So "possible" applies to "in a way that is apparent..". It
means, the user should be aware that all the triggers can change the
row and so the final row will be affected by all those triggers.
Actually, we have a similar statement for UPSERT involved with
triggers in the preceding section. I have taken the statement from
there.

I think where it appears in that sentence made me think it could be
confusing to some. How about reordering sentences in that paragraph so
that the whole paragraphs reads as follows:

If an UPDATE on a partitioned table causes a row to move to another
partition, it will be performed as a DELETE from the original partition
followed by INSERT into the new partition. In this case, all row-level
BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired
on the original partition. Then all row-level BEFORE INSERT triggers are
fired on the destination partition. The possibility of surprising outcomes
should be considered when all these triggers affect the row being moved.
As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT
triggers are applied; but AFTER UPDATE triggers are not applied because
the UPDATE has been converted to a DELETE and INSERT. None of the DELETE
and INSERT statement-level triggers are fired, even if row movement
occurs; only the UPDATE triggers of the target table used in the UPDATE
statement will be fired.

Finally, I forgot to mention during the last review that the new parameter
'returning' to ExecDelete() could be called 'process_returning'.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Amit Khandekar

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Amit Langote (#39)

1 attachment(s)

Re: UPDATE of partition key

On 3 April 2017 at 17:13, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Amit,

Thanks for updating the patch. Since ddl.sgml got updated on Saturday,
patch needs a rebase.

Rebased now.

On 31 March 2017 at 16:54, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 31 March 2017 at 14:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/03/28 19:12, Amit Khandekar wrote:

In the section 5.11 "Partitioning" => subsection 5 "Caveats", I have
removed the caveat of not being able to update partition key. And it
is now replaced by the caveat where an update/delete operations can
silently miss a row when there is a concurrent UPDATE of partition-key
happening.

Hmm, how about just removing the "partition-changing updates are
disallowed" caveat from the list on the 5.11 Partitioning page and explain
the concurrency-related caveats on the UPDATE reference page?

IMHO this caveat is better placed in Partitioning chapter to emphasize
that it is a drawback specifically in presence of partitioning.

I mean we fixed things for declarative partitioning so it's no longer a
caveat like it is for partitioning implemented using inheritance (in that
the former doesn't require user-defined triggers to implement
row-movement). Seeing the first sentence, that is:

An <command>UPDATE</> causes a row to move from one partition to another
if the new value of the row fails to satisfy the implicit partition
constraint of the original partition but there is another partition which
can fit this row.

which clearly seems to suggest that row-movement, if required, is handled
by the system. So it's not clear why it's in this list. If we want to
describe the limitations of the current implementation, we'll need to
rephrase it a bit.

Yes I agree.

How about something like:
For an <command>UPDATE</> that causes a row to move from one partition to
another due the partition key being updated, the following caveats exist:
<a brief description of the possibility of surprising results in the
presence of concurrent manipulation of the row in question>

Now with the slightly changed doc structuring for partitioning in
latest master, I have described in the end of section "5.10.2.
Declarative Partitioning" this note :

---

"Updating the partition key of a row might cause it to be moved into a
different partition where this row satisfies its partition
constraint."

---

And then in the Limitations section, I have replaced the earlier
can't-update-partition-key limitation with this new limitation as
below :

"When an UPDATE causes a row to move from one partition to another,
there is a chance that another concurrent UPDATE or DELETE misses this
row. Suppose, during the row movement, the row is still visible for
the concurrent session, and it is about to do an UPDATE or DELETE
operation on the same row. This DML operation can silently miss this
row if the row now gets deleted from the partition by the first
session as part of its UPDATE row movement. In such case, the
concurrent UPDATE/DELETE, being unaware of the row movement,
interprets that the row has just been deleted so there is nothing to
be done for this row. Whereas, in the usual case where the table is
not partitioned, or where there is no row movement, the second session
would have identified the newly updated row and carried UPDATE/DELETE
on this new row version."

---

Further, in the Notes section of update.sgml, I have kept a link to
the above limitations section like this :

"In the case of a partitioned table, updating a row might cause it to
no longer satisfy the partition constraint of the containing
partition. In that case, if there is some other partition in the
partition tree for which this row satisfies its partition constraint,
then the row is moved to that partition. If there isn't such a
partition, an error will occur. The error will also occur when
updating a partition directly. Behind the scenes, the row movement is
actually a DELETE and INSERT operation. However, there is a
possibility that a concurrent UPDATE or DELETE on the same row may
miss this row. For details see the section Section 5.10.2.3."

+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it is possible that all row-level
+    <literal>BEFORE</> <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</> <command>DELETE</command>/<command>INSERT</command>
+    triggers are applied on the respective partitions in a way that is
apparent
+    from the final state of the updated row.
How about dropping "it is possible that" from this sentence?
What the statement means is : "It is true that all triggers are
applied on the respective partitions; but it is possible that they are
applied in a way that is apparent from final state of the updated
row". So "possible" applies to "in a way that is apparent..". It
means, the user should be aware that all the triggers can change the
row and so the final row will be affected by all those triggers.
Actually, we have a similar statement for UPSERT involved with
triggers in the preceding section. I have taken the statement from
there.
I think where it appears in that sentence made me think it could be
confusing to some. How about reordering sentences in that paragraph so
that the whole paragraphs reads as follows:

If an UPDATE on a partitioned table causes a row to move to another
partition, it will be performed as a DELETE from the original partition
followed by INSERT into the new partition. In this case, all row-level
BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired
on the original partition. Then all row-level BEFORE INSERT triggers are
fired on the destination partition. The possibility of surprising outcomes
should be considered when all these triggers affect the row being moved.
As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT
triggers are applied; but AFTER UPDATE triggers are not applied because
the UPDATE has been converted to a DELETE and INSERT. None of the DELETE
and INSERT statement-level triggers are fired, even if row movement
occurs; only the UPDATE triggers of the target table used in the UPDATE
statement will be fired.

Yeah, most of the above makes sense to me. I have kept the phrase "as
far as statement-level triggers are concerned".

Finally, I forgot to mention during the last review that the new parameter
'returning' to ExecDelete() could be called 'process_returning'.

Done, thanks.

Attached updated patch v7 has the above changes.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v7.patchapplication/octet-stream; name=update-partition-key_v7.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 340c961..43f5081 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2986,6 +2986,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3278,9 +3283,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 8f724c8..b0ed167 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -151,6 +151,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8c58808..c1ccdc5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2651,7 +2651,7 @@ CopyFrom(CopyState cstate)
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr ||
 					resultRelInfo->ri_PartitionCheck)
-					ExecConstraints(resultRelInfo, slot, oldslot, estate);
+					ExecConstraints(resultRelInfo, slot, oldslot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 920b120..d4ba965 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1783,7 +1783,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
@@ -1820,8 +1820,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing,
@@ -1831,7 +1831,7 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
 				TupleTableSlot *slot, TupleTableSlot *orig_slot,
-				EState *estate)
+				EState *estate, bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1918,33 +1918,51 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck &&
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
 		!ExecPartitionCheck(resultRelInfo, slot, estate))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+		ExecPartitionCheckEmitError(resultRelInfo, orig_slot, estate);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
-		{
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-		}
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ *
+ * 'orig_slot' contains the original tuple to be shown in the error message.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *orig_slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 orig_slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 orig_slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index f20d728..2f76140 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+			ExecConstraints(resultRelInfo, slot, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+			ExecConstraints(resultRelInfo, slot, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0b524e0..64e40fe 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -62,7 +62,10 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+											  Relation root_rel);
+static void ExecInitPartitionReturningProjection(ModifyTableState *mtstate,
+												 Relation root_rel);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -435,7 +438,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * Check the constraints of the tuple
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, oldslot, estate);
+			ExecConstraints(resultRelInfo, slot, oldslot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -625,6 +628,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -633,6 +638,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -776,6 +784,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -799,8 +809,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -878,7 +888,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -988,12 +999,90 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	is_partitioned_table = true;
+
+			if (mtstate->mt_partition_dispatch_info == NULL)
+			{
+				ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+				Relation root_rel;
+
+				/*
+				 * If this is a partitioned table, we need to open the root
+				 * table RT index which is at the head of partitioned_rels
+				 */
+				if (node->partitioned_rels)
+				{
+					Index	root_rti;
+					Oid		root_oid;
+
+					root_rti = linitial_int(node->partitioned_rels);
+					root_oid = getrelid(root_rti, estate->es_range_table);
+					root_rel = heap_open(root_oid, NoLock);	/* locked by InitPlan */
+				}
+				else /* this may be a leaf partition */
+					root_rel = mtstate->resultRelInfo->ri_RelationDesc;
+
+				is_partitioned_table =
+					root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+				if (is_partitioned_table)
+					ExecSetupPartitionTupleRouting(
+										root_rel,
+										&mtstate->mt_partition_dispatch_info,
+										&mtstate->mt_partitions,
+										&mtstate->mt_partition_tupconv_maps,
+										&mtstate->mt_partition_tuple_slot,
+										&mtstate->mt_num_dispatch,
+										&mtstate->mt_num_partitions);
+
+				/* Build WITH CHECK OPTION constraints for leaf partitions */
+				ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+
+				/* Build a projection for each leaf partition rel. */
+				ExecInitPartitionReturningProjection(mtstate, root_rel);
+
+				/* Close the root partitioned rel if we opened it above. */
+				if (root_rel != mtstate->resultRelInfo->ri_RelationDesc)
+					heap_close(root_rel, NoLock);
+			}
+
+			if (is_partitioned_table)
+			{
+				bool	concurrently_deleted;
+
+				/*
+				 * Skip RETURNING processing for DELETE. We want to return rows
+				 * from INSERT.
+				 */
+				ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+						   &concurrently_deleted, false, false);
+
+				if (concurrently_deleted)
+					return NULL;
+
+				return ExecInsert(mtstate, slot, planSlot, NULL,
+									  ONCONFLICT_NONE, estate, canSetTag);
+			}
+
+			/* It's not a partitioned table after all; error out. */
+			ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1313,7 +1402,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1583,12 +1672,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1790,44 +1880,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Build WITH CHECK OPTION constraints for each leaf partition rel.
-	 * Note that we didn't build the withCheckOptionList for each partition
-	 * within the planner, but simple translation of the varattnos for each
-	 * partition will suffice.  This only occurs for the INSERT case;
-	 * UPDATE/DELETE cases are handled above.
+	 * Build WITH CHECK OPTION constraints for each leaf partition rel. This
+	 * only occurs for INSERT case; UPDATE/DELETE are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
-	{
-		List		*wcoList;
-
-		Assert(operation == CMD_INSERT);
-		resultRelInfo = mtstate->mt_partitions;
-		wcoList = linitial(node->withCheckOptionLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
-			List	   *wcoExprs = NIL;
-			ListCell   *ll;
-
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
-			{
-				WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
-				ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
-												   mtstate->mt_plans[i]);
-
-				wcoExprs = lappend(wcoExprs, wcoExpr);
-			}
-
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
-			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
-		}
-	}
+	ExecInitPartitionWithCheckOptions(mtstate, rel);
 
 	/*
 	 * Initialize RETURNING projections if needed.
@@ -1836,7 +1892,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1870,28 +1925,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		}
 
 		/*
-		 * Build a projection for each leaf partition rel.  Note that we
-		 * didn't build the returningList for each partition within the
-		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * Build a projection for each leaf partition rel. This only occurs for
+		 * the INSERT case; UPDATE/DELETE are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *rlist;
-
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
-			resultRelInfo->ri_projectReturning =
-				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
-									 resultRelInfo->ri_RelationDesc->rd_att);
-			resultRelInfo++;
-		}
+		ExecInitPartitionReturningProjection(mtstate, rel);
 	}
 	else
 	{
@@ -2118,6 +2155,104 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 }
 
 /* ----------------------------------------------------------------
+ *		ExecInitPartitionWithCheckOptions
+ *
+ * Build WITH CHECK OPTION constraints for each leaf partition rel.
+ * Note that we don't build the withCheckOptionList for each partition
+ * within the planner, but simple translation of the varattnos for each
+ * partition suffices. This only occurs for the INSERT case; UPDATE/DELETE
+ * cases are handled separately.
+ * ----------------------------------------------------------------
+ */
+
+static void
+ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	List		*wcoList;
+	int			i;
+
+	if (node->withCheckOptionLists == NIL || mtstate->mt_num_partitions == 0)
+		return;
+
+	wcoList = linitial(node->withCheckOptionLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *mapped_wcoList;
+		List	   *wcoExprs = NIL;
+		ListCell   *ll;
+
+		/* varno = node->nominalRelation */
+		mapped_wcoList = map_partition_varattnos(wcoList,
+												 node->nominalRelation,
+												 partrel, root_rel);
+		foreach(ll, mapped_wcoList)
+		{
+			WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
+			ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
+										   mtstate->mt_plans[i]);
+
+			wcoExprs = lappend(wcoExprs, wcoExpr);
+		}
+
+		resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+		resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
+		resultRelInfo++;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartitionReturningProjection
+ *
+ * Initialize stuff required to handle RETURNING for leaf partitions.
+ * We don't build the returningList for each partition within the planner, but
+ * simple translation of the varattnos for each partition suffices.  This
+ * actually is helpful only for INSERT case; UPDATE/DELETE are handled
+ * differently.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecInitPartitionReturningProjection(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	TupleTableSlot *returning_slot = mtstate->ps.ps_ResultTupleSlot;
+	List		   *returningList;
+	int				i;
+
+	/*
+	 * If there is no returning clause, or if we have already initialized the
+	 * returning projection info, there is nothing to be done.
+	 */
+	if (node->returningLists == NIL ||
+		(resultRelInfo && resultRelInfo->ri_projectReturning != NULL) ||
+		mtstate->mt_num_partitions == 0)
+		return;
+
+	returningList = linitial(node->returningLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *rlist;
+
+		/* varno = node->nominalRelation */
+		rlist = map_partition_varattnos(returningList,
+										node->nominalRelation,
+										partrel, root_rel);
+		resultRelInfo->ri_projectReturning =
+			ExecBuildProjectionInfo(rlist,
+									mtstate->ps.ps_ExprContext,
+									returning_slot,
+									&mtstate->ps,
+									resultRelInfo->ri_RelationDesc->rd_att);
+		resultRelInfo++;
+	}
+}
+
+
+/* ----------------------------------------------------------------
  *		ExecEndModifyTable
  *
  *		Shuts down the plan.
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d3849b9..102fc97 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,9 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
 				TupleTableSlot *slot, TupleTableSlot *orig_slot,
-				EState *estate);
+				EState *estate, bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *orig_slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -216,6 +218,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..a56afab 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,121 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (b, 12, 116).
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ a | 1 |    
+ a | 4 | 200
+(2 rows)
+
+select * from part_a_10_a_20 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_b_1_b_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ b | 7 | 117
+ b | 9 | 125
+(2 rows)
+
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+(2 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+ a | 1 |  
+(1 row)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+ b | 15 | 199
+(3 rows)
+
 -- cleanup
+drop view upview;
 drop table range_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..cda9906 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,61 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
-
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
-
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_a_10_a_20 order by 1, 2, 3;
+select * from part_b_1_b_10 order by 1, 2, 3;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
 -- cleanup
+drop view upview;
 drop table range_parted;

#41

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Khandekar (#40)

Re: UPDATE of partition key

Hi Amit,

On 2017/04/04 20:11, Amit Khandekar wrote:

On 3 April 2017 at 17:13, Amit Langote wrote:

On 31 March 2017 at 14:04, Amit Langote wrote:

How about something like:
For an <command>UPDATE</> that causes a row to move from one partition to
another due the partition key being updated, the following caveats exist:
<a brief description of the possibility of surprising results in the
presence of concurrent manipulation of the row in question>

Now with the slightly changed doc structuring for partitioning in
latest master, I have described in the end of section "5.10.2.
Declarative Partitioning" this note :

---

"Updating the partition key of a row might cause it to be moved into a
different partition where this row satisfies its partition
constraint."

---

And then in the Limitations section, I have replaced the earlier
can't-update-partition-key limitation with this new limitation as
below :

"When an UPDATE causes a row to move from one partition to another,
there is a chance that another concurrent UPDATE or DELETE misses this
row. Suppose, during the row movement, the row is still visible for
the concurrent session, and it is about to do an UPDATE or DELETE
operation on the same row. This DML operation can silently miss this
row if the row now gets deleted from the partition by the first
session as part of its UPDATE row movement. In such case, the
concurrent UPDATE/DELETE, being unaware of the row movement,
interprets that the row has just been deleted so there is nothing to
be done for this row. Whereas, in the usual case where the table is
not partitioned, or where there is no row movement, the second session
would have identified the newly updated row and carried UPDATE/DELETE
on this new row version."

---

OK.

Further, in the Notes section of update.sgml, I have kept a link to
the above limitations section like this :

"In the case of a partitioned table, updating a row might cause it to
no longer satisfy the partition constraint of the containing
partition. In that case, if there is some other partition in the
partition tree for which this row satisfies its partition constraint,
then the row is moved to that partition. If there isn't such a
partition, an error will occur. The error will also occur when
updating a partition directly. Behind the scenes, the row movement is
actually a DELETE and INSERT operation. However, there is a
possibility that a concurrent UPDATE or DELETE on the same row may
miss this row. For details see the section Section 5.10.2.3."

OK, too. It seems to me that the details in 5.10.2.3 provide more or less
the same information as "concurrent UPDATE or DELETE looking at the moved
row will miss this row", but maybe that's fine.

If an UPDATE on a partitioned table causes a row to move to another
partition, it will be performed as a DELETE from the original partition
followed by INSERT into the new partition. In this case, all row-level
BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired
on the original partition. Then all row-level BEFORE INSERT triggers are
fired on the destination partition. The possibility of surprising outcomes
should be considered when all these triggers affect the row being moved.
As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT
triggers are applied; but AFTER UPDATE triggers are not applied because
the UPDATE has been converted to a DELETE and INSERT. None of the DELETE
and INSERT statement-level triggers are fired, even if row movement
occurs; only the UPDATE triggers of the target table used in the UPDATE
statement will be fired.

Yeah, most of the above makes sense to me. I have kept the phrase "as
far as statement-level triggers are concerned".

OK, sure.

Finally, I forgot to mention during the last review that the new parameter
'returning' to ExecDelete() could be called 'process_returning'.

Done, thanks.

Attached updated patch v7 has the above changes.

Marked as ready for committer.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Langote (#41)

Re: UPDATE of partition key

On Wed, Apr 5, 2017 at 5:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Marked as ready for committer.

Andres seems to have changed the status of this patch to "Needs
review" and then, 30 seconds later, to "Waiting on author", but
there's no actual email on the thread explaining what his concerns
were. I'm going to set this back to "Ready for Committer" and push it
out to the next CommitFest. I think this would be a great feature,
but I think it's not entirely clear that we have consensus on the
design, so let's revisit it for next release.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Robert Haas (#42)

Re: UPDATE of partition key

On 2017-04-07 13:55:51 -0400, Robert Haas wrote:

On Wed, Apr 5, 2017 at 5:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Marked as ready for committer.

Andres seems to have changed the status of this patch to "Needs
review" and then, 30 seconds later, to "Waiting on author"
there's no actual email on the thread explaining what his concerns
were. I'm going to set this back to "Ready for Committer" and push it
out to the next CommitFest. I think this would be a great feature,
but I think it's not entirely clear that we have consensus on the
design, so let's revisit it for next release.

I was kind of looking for the appropriate status of "not entirely clear
that we have consensus on the design" - which isn't really
ready-for-committer, but no waiting-on-author either...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#40)

Re: UPDATE of partition key

On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached updated patch v7 has the above changes.

This no longer applies. Please rebase.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#44)

1 attachment(s)

Re: UPDATE of partition key

On 2 May 2017 at 18:17, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached updated patch v7 has the above changes.

This no longer applies. Please rebase.

Thanks Robert for informing about this.

My patch has a separate function for emitting error message when a
partition constraint fails. And, the recent commit c0a8ae7be3 has
changes to correct the way the tuple is formed for displaying in the
error message. Hence there were some code-level conflicts.

Attached is the rebased patch, which resolves the above conflicts.

Attachments:

update-partition-key_v7_rebased.patchapplication/octet-stream; name=update-partition-key_v7_rebased.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 84c4f20..b3b1816 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2992,6 +2992,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3284,9 +3289,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 6f8416d..97f9317 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -152,6 +152,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index bcaa58c..46b1380 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2643,7 +2643,7 @@ CopyFrom(CopyState cstate)
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr ||
 					resultRelInfo->ri_PartitionCheck)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index cdb1a6a..9303404 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1828,7 +1828,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
@@ -1865,8 +1865,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1874,7 +1874,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1987,45 +1988,61 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck &&
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
 		!ExecPartitionCheck(resultRelInfo, slot, estate))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap	*map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+					gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap	*map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-						gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 327a0ba..7df74bf 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 652cd97..dd53377 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -62,7 +62,10 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+											  Relation root_rel);
+static void ExecInitPartitionReturningProjection(ModifyTableState *mtstate,
+												 Relation root_rel);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -434,7 +437,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * Check the constraints of the tuple
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -624,6 +627,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -632,6 +637,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -775,6 +783,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -798,8 +808,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -877,7 +887,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -987,12 +998,90 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	is_partitioned_table = true;
+
+			if (mtstate->mt_partition_dispatch_info == NULL)
+			{
+				ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+				Relation root_rel;
+
+				/*
+				 * If this is a partitioned table, we need to open the root
+				 * table RT index which is at the head of partitioned_rels
+				 */
+				if (node->partitioned_rels)
+				{
+					Index	root_rti;
+					Oid		root_oid;
+
+					root_rti = linitial_int(node->partitioned_rels);
+					root_oid = getrelid(root_rti, estate->es_range_table);
+					root_rel = heap_open(root_oid, NoLock);	/* locked by InitPlan */
+				}
+				else /* this may be a leaf partition */
+					root_rel = mtstate->resultRelInfo->ri_RelationDesc;
+
+				is_partitioned_table =
+					root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+				if (is_partitioned_table)
+					ExecSetupPartitionTupleRouting(
+										root_rel,
+										&mtstate->mt_partition_dispatch_info,
+										&mtstate->mt_partitions,
+										&mtstate->mt_partition_tupconv_maps,
+										&mtstate->mt_partition_tuple_slot,
+										&mtstate->mt_num_dispatch,
+										&mtstate->mt_num_partitions);
+
+				/* Build WITH CHECK OPTION constraints for leaf partitions */
+				ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+
+				/* Build a projection for each leaf partition rel. */
+				ExecInitPartitionReturningProjection(mtstate, root_rel);
+
+				/* Close the root partitioned rel if we opened it above. */
+				if (root_rel != mtstate->resultRelInfo->ri_RelationDesc)
+					heap_close(root_rel, NoLock);
+			}
+
+			if (is_partitioned_table)
+			{
+				bool	concurrently_deleted;
+
+				/*
+				 * Skip RETURNING processing for DELETE. We want to return rows
+				 * from INSERT.
+				 */
+				ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+						   &concurrently_deleted, false, false);
+
+				if (concurrently_deleted)
+					return NULL;
+
+				return ExecInsert(mtstate, slot, planSlot, NULL,
+									  ONCONFLICT_NONE, estate, canSetTag);
+			}
+
+			/* It's not a partitioned table after all; error out. */
+			ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1312,7 +1401,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1602,12 +1691,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1815,44 +1905,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Build WITH CHECK OPTION constraints for each leaf partition rel.
-	 * Note that we didn't build the withCheckOptionList for each partition
-	 * within the planner, but simple translation of the varattnos for each
-	 * partition will suffice.  This only occurs for the INSERT case;
-	 * UPDATE/DELETE cases are handled above.
+	 * Build WITH CHECK OPTION constraints for each leaf partition rel. This
+	 * only occurs for INSERT case; UPDATE/DELETE are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
-	{
-		List		*wcoList;
-
-		Assert(operation == CMD_INSERT);
-		resultRelInfo = mtstate->mt_partitions;
-		wcoList = linitial(node->withCheckOptionLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
-			List	   *wcoExprs = NIL;
-			ListCell   *ll;
-
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
-			{
-				WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
-				ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
-												   mtstate->mt_plans[i]);
-
-				wcoExprs = lappend(wcoExprs, wcoExpr);
-			}
-
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
-			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
-		}
-	}
+	ExecInitPartitionWithCheckOptions(mtstate, rel);
 
 	/*
 	 * Initialize RETURNING projections if needed.
@@ -1861,7 +1917,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1895,28 +1950,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		}
 
 		/*
-		 * Build a projection for each leaf partition rel.  Note that we
-		 * didn't build the returningList for each partition within the
-		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * Build a projection for each leaf partition rel. This only occurs for
+		 * the INSERT case; UPDATE/DELETE are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *rlist;
-
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
-			resultRelInfo->ri_projectReturning =
-				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
-									 resultRelInfo->ri_RelationDesc->rd_att);
-			resultRelInfo++;
-		}
+		ExecInitPartitionReturningProjection(mtstate, rel);
 	}
 	else
 	{
@@ -2143,6 +2180,104 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 }
 
 /* ----------------------------------------------------------------
+ *		ExecInitPartitionWithCheckOptions
+ *
+ * Build WITH CHECK OPTION constraints for each leaf partition rel.
+ * Note that we don't build the withCheckOptionList for each partition
+ * within the planner, but simple translation of the varattnos for each
+ * partition suffices. This only occurs for the INSERT case; UPDATE/DELETE
+ * cases are handled separately.
+ * ----------------------------------------------------------------
+ */
+
+static void
+ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	List		*wcoList;
+	int			i;
+
+	if (node->withCheckOptionLists == NIL || mtstate->mt_num_partitions == 0)
+		return;
+
+	wcoList = linitial(node->withCheckOptionLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *mapped_wcoList;
+		List	   *wcoExprs = NIL;
+		ListCell   *ll;
+
+		/* varno = node->nominalRelation */
+		mapped_wcoList = map_partition_varattnos(wcoList,
+												 node->nominalRelation,
+												 partrel, root_rel);
+		foreach(ll, mapped_wcoList)
+		{
+			WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
+			ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
+										   mtstate->mt_plans[i]);
+
+			wcoExprs = lappend(wcoExprs, wcoExpr);
+		}
+
+		resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+		resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
+		resultRelInfo++;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartitionReturningProjection
+ *
+ * Initialize stuff required to handle RETURNING for leaf partitions.
+ * We don't build the returningList for each partition within the planner, but
+ * simple translation of the varattnos for each partition suffices.  This
+ * actually is helpful only for INSERT case; UPDATE/DELETE are handled
+ * differently.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecInitPartitionReturningProjection(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	TupleTableSlot *returning_slot = mtstate->ps.ps_ResultTupleSlot;
+	List		   *returningList;
+	int				i;
+
+	/*
+	 * If there is no returning clause, or if we have already initialized the
+	 * returning projection info, there is nothing to be done.
+	 */
+	if (node->returningLists == NIL ||
+		(resultRelInfo && resultRelInfo->ri_projectReturning != NULL) ||
+		mtstate->mt_num_partitions == 0)
+		return;
+
+	returningList = linitial(node->returningLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *rlist;
+
+		/* varno = node->nominalRelation */
+		rlist = map_partition_varattnos(returningList,
+										node->nominalRelation,
+										partrel, root_rel);
+		resultRelInfo->ri_projectReturning =
+			ExecBuildProjectionInfo(rlist,
+									mtstate->ps.ps_ExprContext,
+									returning_slot,
+									&mtstate->ps,
+									resultRelInfo->ri_RelationDesc->rd_att);
+		resultRelInfo++;
+	}
+}
+
+
+/* ----------------------------------------------------------------
  *		ExecEndModifyTable
  *
  *		Shuts down the plan.
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3107cf5..cead8eb 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -186,7 +186,10 @@ extern void InitResultRelInfo(ResultRelInfo *resultRelInfo,
 extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -215,6 +218,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..a56afab 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,121 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (b, 12, 116).
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ a | 1 |    
+ a | 4 | 200
+(2 rows)
+
+select * from part_a_10_a_20 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_b_1_b_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ b | 7 | 117
+ b | 9 | 125
+(2 rows)
+
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+(2 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+ a | 1 |  
+(1 row)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+ b | 15 | 199
+(3 rows)
+
 -- cleanup
+drop view upview;
 drop table range_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..cda9906 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,61 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
-
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
-
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_a_10_a_20 order by 1, 2, 3;
+select * from part_b_1_b_10 order by 1, 2, 3;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
 -- cleanup
+drop view upview;
 drop table range_parted;

#46

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#31)

Re: UPDATE of partition key

On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition? It seems weird to run BR
update triggers but not AR update triggers. Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

I also find the current behavior with respect to triggers quite odd.
The two points that appears odd are (a) Executing both before row
update and delete triggers on original partition sounds quite odd. (b)
It seems inconsistent to consider behavior for row and statement
triggers differently

I checked the trigger behaviour in case of UPSERT. Here, when there is
conflict found, ExecOnConflictUpdate() is called, and then the
function returns immediately, which means AR INSERT trigger will not
fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
and AR UPDATE triggers will be fired. So in short, when an INSERT
becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
and AR UPDATE also get fired. On the same lines, it makes sense in
case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
the original table, and then the BR and AR DELETE/INSERT triggers on
the respective tables.

I am not sure if it is good idea to compare it with "Insert On
Conflict Do Update", but even if we want that way, I think Insert On
Conflict is consistent in statement level triggers which means it will
fire both Insert and Update statement level triggres (as per below
note in docs) whereas the documentation in the patch indicates that
this patch will only fire Update statement level triggers which is
odd.

Note in docs about Insert On Conflict
"Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both
INSERT and UPDATE statement level trigger will be fired.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#45)

Re: UPDATE of partition key

On Wed, May 3, 2017 at 11:22 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 2 May 2017 at 18:17, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached updated patch v7 has the above changes.

Attached is the rebased patch, which resolves the above conflicts.

Few comments:
1.
Operating directly on partition doesn't allow update to move row.
Refer below example:
create table t1(c1 int) partition by range(c1);
create table t1_part_1 partition of t1 for values from (1) to (100);
create table t1_part_2 partition of t1 for values from (100) to (200);
insert into t1 values(generate_series(1,11));
insert into t1 values(generate_series(110,120));

postgres=# update t1_part_1 set c1=122 where c1=11;
ERROR: new row for relation "t1_part_1" violates partition constraint
DETAIL: Failing row contains (122).

2.
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+  Relation root_rel);

Spurious line delete.

3.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur.

Doesn't this error case indicate that this needs to be integrated with
Default partition patch of Rahila or that patch needs to take care
this error case?
Basically, if there is no matching partition, then move it to default partition.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#47)

Re: UPDATE of partition key

On Thu, May 11, 2017 at 7:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments:
1.
Operating directly on partition doesn't allow update to move row.
Refer below example:
create table t1(c1 int) partition by range(c1);
create table t1_part_1 partition of t1 for values from (1) to (100);
create table t1_part_2 partition of t1 for values from (100) to (200);
insert into t1 values(generate_series(1,11));
insert into t1 values(generate_series(110,120));

postgres=# update t1_part_1 set c1=122 where c1=11;
ERROR: new row for relation "t1_part_1" violates partition constraint
DETAIL: Failing row contains (122).

I think that's correct behavior.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#46)

Re: UPDATE of partition key

On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition? It seems weird to run BR
update triggers but not AR update triggers. Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

I also find the current behavior with respect to triggers quite odd.
The two points that appears odd are (a) Executing both before row
update and delete triggers on original partition sounds quite odd.

Note that *before* trigger gets fired *before* the update happens. The
actual update may not even happen, depending upon what the trigger
does. And then in our case, the update does not happen; not just that,
it is transformed into delete-insert. So then we should fire
before-delete trigger.

(b) It seems inconsistent to consider behavior for row and statement
triggers differently

I am not sure whether we should compare row and statement triggers.
Statement triggers are anyway fired only per-statement, depending upon
whether it is update or insert or delete. It has nothing to do with
how the rows are modified.

I checked the trigger behaviour in case of UPSERT. Here, when there is
conflict found, ExecOnConflictUpdate() is called, and then the
function returns immediately, which means AR INSERT trigger will not
fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
and AR UPDATE triggers will be fired. So in short, when an INSERT
becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
and AR UPDATE also get fired. On the same lines, it makes sense in
case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
the original table, and then the BR and AR DELETE/INSERT triggers on
the respective tables.

I am not sure if it is good idea to compare it with "Insert On
Conflict Do Update", but even if we want that way, I think Insert On
Conflict is consistent in statement level triggers which means it will
fire both Insert and Update statement level triggres (as per below
note in docs) whereas the documentation in the patch indicates that
this patch will only fire Update statement level triggers which is
odd

Note in docs about Insert On Conflict
"Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both
INSERT and UPDATE statement level trigger will be fired.

I guess the reason this behaviour is kept for UPSERT, is because the
statement itself suggests : insert/update.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#47)

Re: UPDATE of partition key

On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments:
1.
Operating directly on partition doesn't allow update to move row.
Refer below example:
create table t1(c1 int) partition by range(c1);
create table t1_part_1 partition of t1 for values from (1) to (100);
create table t1_part_2 partition of t1 for values from (100) to (200);
insert into t1 values(generate_series(1,11));
insert into t1 values(generate_series(110,120));

postgres=# update t1_part_1 set c1=122 where c1=11;
ERROR: new row for relation "t1_part_1" violates partition constraint
DETAIL: Failing row contains (122).

Yes, as Robert said, this is expected behaviour. We move the row only
within the partition subtree that has the update table as its root. In
this case, it's the leaf partition.

3.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur.
Doesn't this error case indicate that this needs to be integrated with
Default partition patch of Rahila or that patch needs to take care
this error case?
Basically, if there is no matching partition, then move it to default partition.

Will have a look on this. Thanks for pointing this out.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#49)

Re: UPDATE of partition key

On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition? It seems weird to run BR
update triggers but not AR update triggers. Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

I also find the current behavior with respect to triggers quite odd.
The two points that appears odd are (a) Executing both before row
update and delete triggers on original partition sounds quite odd.

Note that *before* trigger gets fired *before* the update happens. The
actual update may not even happen, depending upon what the trigger
does. And then in our case, the update does not happen; not just that,
it is transformed into delete-insert. So then we should fire
before-delete trigger.

Sure, I am aware of that point, but it doesn't seem obvious that both
update and delete BR triggers get fired for original partition. As a
developer, it might be obvious to you that as you have used delete and
insert interface, it is okay that corresponding BR/AR triggers get
fired, however, it is not so obvious for others, rather it appears
quite odd. If we try to compare it with the non-partitioned update,
there also it is internally a delete and insert operation, but we
don't fire triggers for those.

(b) It seems inconsistent to consider behavior for row and statement
triggers differently

I am not sure whether we should compare row and statement triggers.
Statement triggers are anyway fired only per-statement, depending upon
whether it is update or insert or delete. It has nothing to do with
how the rows are modified.

Okay. The broader point I was trying to convey was that the way this
patch defines the behavior of triggers doesn't sound good to me. It
appears to me that in this thread multiple people have raised points
around trigger behavior and you should try to consider those. Apart
from the options, Robert has suggested, another option could be that
we allow firing BR-AR update triggers for original partition and BR-AR
insert triggers for the new partition. In this case, one can argue
that we have not actually updated the row in the original partition,
so there is no need to fire AR update triggers, but I feel that is
what we do for non-partitioned table update and it should be okay here
as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#50)

Re: UPDATE of partition key

On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments:
1.
Operating directly on partition doesn't allow update to move row.
Refer below example:
create table t1(c1 int) partition by range(c1);
create table t1_part_1 partition of t1 for values from (1) to (100);
create table t1_part_2 partition of t1 for values from (100) to (200);
insert into t1 values(generate_series(1,11));
insert into t1 values(generate_series(110,120));

postgres=# update t1_part_1 set c1=122 where c1=11;
ERROR: new row for relation "t1_part_1" violates partition constraint
DETAIL: Failing row contains (122).

Yes, as Robert said, this is expected behaviour. We move the row only
within the partition subtree that has the update table as its root. In
this case, it's the leaf partition.

Okay, but what is the technical reason behind it? Is it because the
current design doesn't support it or is it because of something very
fundamental to partitions? Is it because we can't find root partition
from leaf partition?

+ is_partitioned_table =
+ root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+ if (is_partitioned_table)
+ ExecSetupPartitionTupleRouting(
+ root_rel,
+ /* Build WITH CHECK OPTION constraints for leaf partitions */
+ ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+ /* Build a projection for each leaf partition rel. */
+ ExecInitPartitionReturningProjection(mtstate, root_rel);
..
+ /* It's not a partitioned table after all; error out. */
+ ExecPartitionCheckEmitError(resultRelInfo, slot, estate);

When we are anyway going to give error if table is not a partitioned
table, then isn't it better to give it early when we first identify
that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Kapila (#52)

Re: UPDATE of partition key

On Fri, May 12, 2017 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments:
1.
Operating directly on partition doesn't allow update to move row.
Refer below example:
create table t1(c1 int) partition by range(c1);
create table t1_part_1 partition of t1 for values from (1) to (100);
create table t1_part_2 partition of t1 for values from (100) to (200);
insert into t1 values(generate_series(1,11));
insert into t1 values(generate_series(110,120));

postgres=# update t1_part_1 set c1=122 where c1=11;
ERROR: new row for relation "t1_part_1" violates partition constraint
DETAIL: Failing row contains (122).

Yes, as Robert said, this is expected behaviour. We move the row only
within the partition subtree that has the update table as its root. In
this case, it's the leaf partition.

Okay, but what is the technical reason behind it? Is it because the
current design doesn't support it or is it because of something very
fundamental to partitions?

One plausible theory is that as Select's on partitions just returns
the rows of that partition, the update should also behave in same way.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#51)

Re: UPDATE of partition key

On 12 May 2017 at 08:30, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 11 May 2017 at 17:23, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 March 2017 at 12:49, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I think it does not make sense running after row triggers in case of
row-movement. There is no update happened on that leaf partition. This
reasoning can also apply to BR update triggers. But the reasons for
having a BR trigger and AR triggers are quite different. Generally, a
user needs to do some modifications to the row before getting the
final NEW row into the database, and hence [s]he defines a BR trigger
for that. And we can't just silently skip this step only because the
final row went into some other partition; in fact the row-movement
itself might depend on what the BR trigger did with the row. Whereas,
AR triggers are typically written for doing some other operation once
it is made sure the row is actually updated. In case of row-movement,
it is not actually updated.

How about running the BR update triggers for the old partition and the
AR update triggers for the new partition? It seems weird to run BR
update triggers but not AR update triggers. Another option would be
to run BR and AR delete triggers and then BR and AR insert triggers,
emphasizing the choice to treat this update as a delete + insert, but
(as Amit Kh. pointed out to me when we were in a room together this
week) that precludes using the BEFORE trigger to modify the row.

I also find the current behavior with respect to triggers quite odd.
The two points that appears odd are (a) Executing both before row
update and delete triggers on original partition sounds quite odd.

Note that *before* trigger gets fired *before* the update happens. The
actual update may not even happen, depending upon what the trigger
does. And then in our case, the update does not happen; not just that,
it is transformed into delete-insert. So then we should fire
before-delete trigger.

Sure, I am aware of that point, but it doesn't seem obvious that both
update and delete BR triggers get fired for original partition. As a
developer, it might be obvious to you that as you have used delete and
insert interface, it is okay that corresponding BR/AR triggers get
fired, however, it is not so obvious for others, rather it appears
quite odd.

I agree that it seems a bit odd that we are firing both update and
delete triggers on the same partition. But if you look at the
perspective that the update=>delete+insert is a user-aware operation,
it does not seem that odd.

If we try to compare it with the non-partitioned update,
there also it is internally a delete and insert operation, but we
don't fire triggers for those.

For a non-partitioned table, the delete+insert is internal, whereas
for partitioned table, it is completely visible to the user.

(b) It seems inconsistent to consider behavior for row and statement
triggers differently

I am not sure whether we should compare row and statement triggers.
Statement triggers are anyway fired only per-statement, depending upon
whether it is update or insert or delete. It has nothing to do with
how the rows are modified.

Okay. The broader point I was trying to convey was that the way this
patch defines the behavior of triggers doesn't sound good to me. It
appears to me that in this thread multiple people have raised points
around trigger behavior and you should try to consider those.

I understand that there is no single solution which will provide
completely intuitive trigger behaviour. Skipping BR delete trigger
should be fine. But then for consistency, we should skip BR insert
trigger as well, the theory being that the delete+insert are not fired
by the user so we should not fire them. But I feel both should be
fired to avoid any consequences unexpected to the user who has
installed those triggers.

The only specific concern of yours is that of firing *both* update as
well as insert triggers on the same table, right ? My explanation for
this was : we have done this before for UPSERT, and we had documented
the same. We can do it here also.

Apart from the options, Robert has suggested, another option could be that
we allow firing BR-AR update triggers for original partition and BR-AR
insert triggers for the new partition. In this case, one can argue
that we have not actually updated the row in the original partition,
so there is no need to fire AR update triggers,

Yes that's what I think. If there is no update happened, then AR
update trigger should not be executed. AR triggers are only for
scenarios where it is guaranteed that the DML operation has happened
when the trigger is being executed.

but I feel that is what we do for non-partitioned table update and it should be okay here
as well.

I don't think so. For e.g. if a BR trigger returns NULL, the update
does not happen, and then the AR trigger does not fire as well. Do you
see any other scenarios for non-partitioned tables, where AR triggers
do fire when the update does not happen ?

Overall, I am also open to skipping both insert+delete BR trigger, but
I am trying to convince above that this might not be as odd as it
sounds, especially if we document this clearly why we have done.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#53)

Re: UPDATE of partition key

On 12 May 2017 at 10:01, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 12, 2017 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 11 May 2017 at 17:24, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments:
1.
Operating directly on partition doesn't allow update to move row.
Refer below example:
create table t1(c1 int) partition by range(c1);
create table t1_part_1 partition of t1 for values from (1) to (100);
create table t1_part_2 partition of t1 for values from (100) to (200);
insert into t1 values(generate_series(1,11));
insert into t1 values(generate_series(110,120));

postgres=# update t1_part_1 set c1=122 where c1=11;
ERROR: new row for relation "t1_part_1" violates partition constraint
DETAIL: Failing row contains (122).

Yes, as Robert said, this is expected behaviour. We move the row only
within the partition subtree that has the update table as its root. In
this case, it's the leaf partition.

Okay, but what is the technical reason behind it? Is it because the
current design doesn't support it or is it because of something very
fundamental to partitions?

No, we can do that if decide to update some table outside the
partition subtree. The reason is more of semantics. I think the user
who is running UPDATE for a partitioned table, should not be
necessarily aware of the structure of the complete partition tree
outside of the current subtree. It is always safe to return error
instead of moving the data outside of the subtree silently.

One plausible theory is that as Select's on partitions just returns
the rows of that partition, the update should also behave in same way.

Yes , right. Or even inserts fail if we try to insert data that does
not fit into the current subtree.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#50)

Re: UPDATE of partition key

3.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur.
Doesn't this error case indicate that this needs to be integrated with
Default partition patch of Rahila or that patch needs to take care
this error case?
Basically, if there is no matching partition, then move it to default partition.
Will have a look on this. Thanks for pointing this out.

I tried update row movement with both my patch and default-partition
patch applied. And it looks like it works as expected :

1. When an update changes the partitioned key such that the row does
not fit into any of the non-default partitions, the row is moved to
the default partition.
2. If the row does fit into a non-default partition, the row moves
into that partition.
3. If a row from a default partition is updated such that it fits into
any of the non-default partition, it moves into that partition. I
think we can debate on whether the row should stay in the default
partition or move. I think it should be moved, since now the row has a
suitable partition.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#22)

Re: UPDATE of partition key

On Fri, Feb 24, 2017 at 3:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

It is of course very good that we have something ready for this
release and can make a choice of what to do.

Thoughts

1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
almost exactly the same thing. An UPDATE which gets to a
HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
metadata, or I might be persuaded that in-this-release it is
acceptable to fail when this occurs with an ERROR and a retryable
SQLCODE, since the UPDATE will succeed on next execution.

I've got my doubts about whether we can make that bit work that way,
considering that we still support pg_upgrade (possibly in multiple
steps) from old releases that had VACUUM FULL. We really ought to put
some work into reclaiming those old bits, but there's probably no time
for that in v10.

I agree with you that it might not be straightforward to make it work,
but now that earliest it can go is v11, do we want to try doing
something other than just documenting it. What I could read from this
e-mail thread is that you are intending towards just documenting it
for the first cut of this feature. However, both Greg and Simon are of
opinion that we should do something about this and even patch Author
(Amit Khandekar) has shown some inclination to do something about this
point (return error to the user in some way), so I think we can't
ignore this point.

I think now that we have some more time, it is better to try something
based on a couple of ideas floating in this thread to address this
point and see if we can come up with something doable without a big
architecture change.

What is your take on this point now?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#54)

Re: UPDATE of partition key

On Fri, May 12, 2017 at 10:49 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 12 May 2017 at 08:30, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

If we try to compare it with the non-partitioned update,
there also it is internally a delete and insert operation, but we
don't fire triggers for those.

For a non-partitioned table, the delete+insert is internal, whereas
for partitioned table, it is completely visible to the user.

If the user has executed an update on root table, then it is
transparent. I think we can consider it user visible only in case if
there is some user visible syntax like "Update ... Move Row If
Constraint Not Satisfied"

(b) It seems inconsistent to consider behavior for row and statement
triggers differently

I am not sure whether we should compare row and statement triggers.
Statement triggers are anyway fired only per-statement, depending upon
whether it is update or insert or delete. It has nothing to do with
how the rows are modified.

Okay. The broader point I was trying to convey was that the way this
patch defines the behavior of triggers doesn't sound good to me. It
appears to me that in this thread multiple people have raised points
around trigger behavior and you should try to consider those.

I understand that there is no single solution which will provide
completely intuitive trigger behaviour. Skipping BR delete trigger
should be fine. But then for consistency, we should skip BR insert
trigger as well, the theory being that the delete+insert are not fired
by the user so we should not fire them. But I feel both should be
fired to avoid any consequences unexpected to the user who has
installed those triggers.

The only specific concern of yours is that of firing *both* update as
well as insert triggers on the same table, right ? My explanation for
this was : we have done this before for UPSERT, and we had documented
the same. We can do it here also.

Apart from the options, Robert has suggested, another option could be that
we allow firing BR-AR update triggers for original partition and BR-AR
insert triggers for the new partition. In this case, one can argue
that we have not actually updated the row in the original partition,
so there is no need to fire AR update triggers,

Yes that's what I think. If there is no update happened, then AR
update trigger should not be executed. AR triggers are only for
scenarios where it is guaranteed that the DML operation has happened
when the trigger is being executed.

but I feel that is what we do for non-partitioned table update and it should be okay here
as well.

I don't think so. For e.g. if a BR trigger returns NULL, the update
does not happen, and then the AR trigger does not fire as well. Do you
see any other scenarios for non-partitioned tables, where AR triggers
do fire when the update does not happen ?

No, but here also it can be considered as an update for original partition.

Overall, I am also open to skipping both insert+delete BR trigger,

I think it might be better to summarize all the options discussed
including what the patch has and see what most people consider as
sensible.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#58)

Re: UPDATE of partition key

On 12 May 2017 at 14:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think it might be better to summarize all the options discussed
including what the patch has and see what most people consider as
sensible.

Yes, makes sense. Here are the options that were discussed so far for
ROW triggers :

Option 1 : (the patch follows this option)
----------
BR Update trigger for source partition.
BR,AR Delete trigger for source partition.
BR,AR Insert trigger for destination partition.
No AR Update trigger.

Rationale :

BR Update trigger should be fired because that trigger can even modify
the rows, and that can even result in partition key update even though
the UPDATE statement is not updating the partition key.

Also, fire the delete/insert triggers on respective partitions since
the rows are about to be deleted/inserted. AR update trigger should
not be fired because that required an actual update to have happened.

Option 2
----------
BR Update trigger for source partition.
AR Update trigger on destination partition.
No insert/delete triggers.

Rationale :

Since it's an UPDATE statement, only update triggers should be fired.
The update ends up moving the row into another partition, so AR Update
trigger should be fired on this partition rather than the original
partition.

Option 3
--------

BR, AR delete triggers on source partition
BR, AR insert triggers on destination partition.

Rationale :
Since the update is converted to delete+insert, just skip the update
triggers completely.

Option 4
--------

BR-AR update triggers for source partition
BR-AR insert triggers for destination partition

Rationale :
Since it is an update statement, both BR and AR UPDATE trigger should
be fired on original partition.
Since update is converted to delete+insert, the corresponding triggers
should be fired, but since we already are firing UPDATE trigger on
original partition, skip delete triggers, otherwise both UPDATE and
DELETE triggers would get fired on the same partition.

----------------

For statement triggers, I think it should be based on the
documentation recently checked in for partitions in general.

+    A statement that targets a parent table in a inheritance or partitioning
+    hierarchy does not cause the statement-level triggers of affected child
+    tables to be fired; only the parent table's statement-level triggers are
+    fired.  However, row-level triggers of any affected child tables will be
+    fired.

Based on that, for row movement as well, the trigger should be fired
only for the table referred in the UPDATE statement, and not for any
child tables, or for any partitions to which the rows were moved. The
doc in this row-movement patch also matches with this behaviour.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#57)

Re: UPDATE of partition key

On Fri, May 12, 2017 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I agree with you that it might not be straightforward to make it work,
but now that earliest it can go is v11, do we want to try doing
something other than just documenting it. What I could read from this
e-mail thread is that you are intending towards just documenting it
for the first cut of this feature. However, both Greg and Simon are of
opinion that we should do something about this and even patch Author
(Amit Khandekar) has shown some inclination to do something about this
point (return error to the user in some way), so I think we can't
ignore this point.

I think now that we have some more time, it is better to try something
based on a couple of ideas floating in this thread to address this
point and see if we can come up with something doable without a big
architecture change.

What is your take on this point now?

I still don't think it's worth spending a bit on this, especially not
with WARM probably gobbling up multiple bits. Reclaiming the bits
seems like a good idea, but spending one on this still seems to me
like it's probably not the best use of our increasingly-limited supply
of infomask bits. Now, Simon and Greg may still feel otherwise, of
course.

I could get behind providing an option to turn this behavior on and
off at the level of the partitioned table. That would use a reloption
rather than an infomask bit, so no scarce resource is being consumed.
I suspect that most people don't update the partition keys at all (so
they don't care either way) and the ones who do are probably either
depending on EPQ (in which case they most likely want to just disallow
all UPDATE-row-movement) or not (in which case they again don't care).
If I understand correctly, the only people who will benefit from
consuming an infomask bit are the people who update their partition
keys AND depend on EPQ BUT only for non-key updates AND need the
system to make sure that they don't accidentally rely on it for the
case of an EPQ update. That seems (to me, anyway) like it's got to be
a really small percentage of actual users, but I just work here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#59)

Re: UPDATE of partition key

On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Option 3
--------

BR, AR delete triggers on source partition
BR, AR insert triggers on destination partition.

Rationale :
Since the update is converted to delete+insert, just skip the update
triggers completely.

+1 to option3
Generally, BR triggers are used for updating the ROW value and AR
triggers to VALIDATE the row or to modify some other tables. So it
seems that we can fire the triggers what is actual operation is
happening at the partition level.

For source partition, it's only the delete operation (no update
happened) so we fire delete triggers and for the destination only
insert operations so fire only inserts triggers. That will keep the
things simple. And, it will also be in sync with the actual partition
level delete/insert operations.

We may argue that user might have declared only update triggers and as
he has executed the update operation he may expect those triggers to
get fired. But, I think this behaviour can be documented with the
proper logic that if the user is updating the partition key then he
must be ready with the Delete/Insert triggers also, he can not rely
only upon update level triggers.

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#61)

Re: UPDATE of partition key

On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Option 3
--------

BR, AR delete triggers on source partition
BR, AR insert triggers on destination partition.

Rationale :
Since the update is converted to delete+insert, just skip the update
triggers completely.

+1 to option3

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

If we have to go by this theory, then the option you have preferred
will still execute BR triggers for both delete and insert, so input
row can still be changed twice.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#60)

Re: UPDATE of partition key

On Mon, May 15, 2017 at 5:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 12, 2017 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I agree with you that it might not be straightforward to make it work,
but now that earliest it can go is v11, do we want to try doing
something other than just documenting it. What I could read from this
e-mail thread is that you are intending towards just documenting it
for the first cut of this feature. However, both Greg and Simon are of
opinion that we should do something about this and even patch Author
(Amit Khandekar) has shown some inclination to do something about this
point (return error to the user in some way), so I think we can't
ignore this point.

I think now that we have some more time, it is better to try something
based on a couple of ideas floating in this thread to address this
point and see if we can come up with something doable without a big
architecture change.

What is your take on this point now?

I still don't think it's worth spending a bit on this, especially not
with WARM probably gobbling up multiple bits. Reclaiming the bits
seems like a good idea, but spending one on this still seems to me
like it's probably not the best use of our increasingly-limited supply
of infomask bits.

I think we can do this even without using an additional infomask bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

Now, Simon and Greg may still feel otherwise, of
course.

I could get behind providing an option to turn this behavior on and
off at the level of the partitioned table.

Sure that sounds like a viable option and we can set the default value
as false. However, it might be better if we can detect the same
internally without big changes.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Amit Kapila (#62)

Re: UPDATE of partition key

On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

If we have to go by this theory, then the option you have preferred
will still execute BR triggers for both delete and insert, so input
row can still be changed twice.

Yeah, right as per my theory above Option3 have the same problem.

But after putting some more thought I realised that only for "Before
Update" or the "Before Insert" trigger row can be changed. Correct me
if I am assuming something wrong?

So now again option3 will make more sense.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#63)

Re: UPDATE of partition key

On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we can do this even without using an additional infomask bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

Hmm. How would that work?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Rushabh Lathia

rushabh.lathia@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#61)

Re: UPDATE of partition key

On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Option 3
--------

BR, AR delete triggers on source partition
BR, AR insert triggers on destination partition.

Rationale :
Since the update is converted to delete+insert, just skip the update
triggers completely.

+1 to option3
Generally, BR triggers are used for updating the ROW value and AR
triggers to VALIDATE the row or to modify some other tables. So it
seems that we can fire the triggers what is actual operation is
happening at the partition level.

For source partition, it's only the delete operation (no update
happened) so we fire delete triggers and for the destination only
insert operations so fire only inserts triggers. That will keep the
things simple. And, it will also be in sync with the actual partition
level delete/insert operations.

We may argue that user might have declared only update triggers and as
he has executed the update operation he may expect those triggers to
get fired. But, I think this behaviour can be documented with the
proper logic that if the user is updating the partition key then he
must be ready with the Delete/Insert triggers also, he can not rely
only upon update level triggers.

Right, that is even my concern. That user might had declared only update
triggers and when user executing UPDATE its expect it to get call - but
with option 3 its not happening.

In term of consistency option 1 looks better. Its doing the same what
its been implemented for the UPSERT - so that user might be already
aware of trigger behaviour. Plus if we document the behaviour then it
sounds correct -

- Original command was UPDATE so BR update
- Later found that its ROW movement - so BR delete followed by AR delete
- Then Insert in new partition - so BR INSERT followed by AR Insert.

But again I am not quite sure how good it will be to compare the partition
behaviour with the UPSERT.

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Rushabh Lathia

#67

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#65)

Re: UPDATE of partition key

On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we can do this even without using an additional infomask bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

Hmm. How would that work?

We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated). Can you explain what difficulty are you
envisioning?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Rushabh Lathia (#66)

Re: UPDATE of partition key

On 17 May 2017 at 17:29, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Option 3
--------

BR, AR delete triggers on source partition
BR, AR insert triggers on destination partition.

Rationale :
Since the update is converted to delete+insert, just skip the update
triggers completely.

+1 to option3
Generally, BR triggers are used for updating the ROW value and AR
triggers to VALIDATE the row or to modify some other tables. So it
seems that we can fire the triggers what is actual operation is
happening at the partition level.

For source partition, it's only the delete operation (no update
happened) so we fire delete triggers and for the destination only
insert operations so fire only inserts triggers. That will keep the
things simple. And, it will also be in sync with the actual partition
level delete/insert operations.

We may argue that user might have declared only update triggers and as
he has executed the update operation he may expect those triggers to
get fired. But, I think this behaviour can be documented with the
proper logic that if the user is updating the partition key then he
must be ready with the Delete/Insert triggers also, he can not rely
only upon update level triggers.

Right, that is even my concern. That user might had declared only update
triggers and when user executing UPDATE its expect it to get call - but
with option 3 its not happening.

Yes that's the issue with option 3. A user wants to make sure update
triggers run, and here we are skipping the BEFORE update triggers. And
user might even modify rows.

Now regarding the AR update triggers .... The user might be more
concerned with the non-partition-key columns, and the UPDATE of
partition key typically would update only the partition key and not
the other column. So for typical case, it makes sense to skip the
UPDATE AR trigger. But if the UPDATE contains both partition key as
well as other column updates, it makes sense to fire AR UPDATE
trigger. One thing we can do is restrict an UPDATE to have both
partition key and non-partition key column updates. So this way we can
always skip the AR update triggers for row-movement updates, unless
may be fire AR UPDATE triggers *only* if they are created using
"BEFORE UPDATE OF <column_name>" and the column is the partition key.

Between skipping delete-insert triggers versus skipping update
triggers, I would go for skipping delete-insert triggers. I think we
cannot skip BR update triggers because that would be a correctness
issue.

From user-perspective, I think the user would like to install a
trigger that would fire if any of the child tables get modified. But
because there is no provision to install a common trigger, the user
has to install the same trigger on every child table. In that sense,
it might not matter whether we fire AR UPDATE trigger on old partition
or new partition.

In term of consistency option 1 looks better. Its doing the same what
its been implemented for the UPSERT - so that user might be already
aware of trigger behaviour. Plus if we document the behaviour then it
sounds correct -

- Original command was UPDATE so BR update
- Later found that its ROW movement - so BR delete followed by AR delete
- Then Insert in new partition - so BR INSERT followed by AR Insert.

But again I am not quite sure how good it will be to compare the partition
behaviour with the UPSERT.

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Rushabh Lathia

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#64)

Re: UPDATE of partition key

On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

If we have to go by this theory, then the option you have preferred
will still execute BR triggers for both delete and insert, so input
row can still be changed twice.

Yeah, right as per my theory above Option3 have the same problem.

But after putting some more thought I realised that only for "Before
Update" or the "Before Insert" trigger row can be changed.

Before Row Delete triggers can suppress the delete operation itself
which is kind of unintended in this case. I think without the user
being aware it doesn't seem advisable to execute multiple BR triggers.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#69)

Re: UPDATE of partition key

On 18 May 2017 at 16:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

If we have to go by this theory, then the option you have preferred
will still execute BR triggers for both delete and insert, so input
row can still be changed twice.

Yeah, right as per my theory above Option3 have the same problem.

But after putting some more thought I realised that only for "Before
Update" or the "Before Insert" trigger row can be changed.

Before Row Delete triggers can suppress the delete operation itself
which is kind of unintended in this case. I think without the user
being aware it doesn't seem advisable to execute multiple BR triggers.

By now, majority of the opinions have shown that they do not favour
two triggers getting fired on a single update. Amit, do you consider
option 2 as a valid option ? That is, fire only UPDATE triggers. BR on
source partition, and AR on destination partition. Do you agree that
firing BR update trigger is essential since it can modify the row and
even prevent the update from happening ?

Also, since a user does not have a provision to install a common
UPDATE row trigger, (s)he installs it on each of the leaf partitions.
And then when an update causes row movement, using option 3 would end
up not firing update triggers on any of the partitions. So, I prefer
option 2 over option 3 , i.e. make sure to fire BR and AR update
triggers. Actually option 2 is what Robert had proposed in the
beginning.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#52)

1 attachment(s)

Re: UPDATE of partition key

On 12 May 2017 at 09:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

+ is_partitioned_table =
+ root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+ if (is_partitioned_table)
+ ExecSetupPartitionTupleRouting(
+ root_rel,
+ /* Build WITH CHECK OPTION constraints for leaf partitions */
+ ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+ /* Build a projection for each leaf partition rel. */
+ ExecInitPartitionReturningProjection(mtstate, root_rel);
..
+ /* It's not a partitioned table after all; error out. */
+ ExecPartitionCheckEmitError(resultRelInfo, slot, estate);

When we are anyway going to give error if table is not a partitioned
table, then isn't it better to give it early when we first identify
that.

Yeah that's right, fixed. Moved the partitioned table check early.
This also showed that there is no need for is_partitioned_table
variable. Accordingly adjusted the code.

-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+  Relation root_rel);
Spurious line delete.

Done.

Also rebased the patch over latest code.

Attached v8 patch.

Attachments:

update-partition-key_v8.patchapplication/octet-stream; name=update-partition-key_v8.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 84c4f20..b3b1816 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2992,6 +2992,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3284,9 +3289,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 84b1a54..c1d8d0b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2643,7 +2643,7 @@ CopyFrom(CopyState cstate)
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr ||
 					resultRelInfo->ri_PartitionCheck)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4a899f1..d5e6779 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1838,7 +1838,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  *
  * Note: This is called *iff* resultRelInfo is the main target table.
  */
-static bool
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
@@ -1875,8 +1875,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1884,7 +1884,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1997,45 +1998,61 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck &&
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
 		!ExecPartitionCheck(resultRelInfo, slot, estate))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
-		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
 								 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+		if (map != NULL)
+		{
+			tuple = do_convert_tuple(tuple, map);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index c6a66b6..7e82482 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index cf555fe..82a61b2 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -62,7 +62,10 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+											  Relation root_rel);
+static void ExecInitPartitionReturningProjection(ModifyTableState *mtstate,
+												 Relation root_rel);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -434,7 +437,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * Check the constraints of the tuple
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -624,6 +627,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -632,6 +637,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -775,6 +783,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -798,8 +808,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -877,7 +887,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -987,12 +998,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	concurrently_deleted;
+
+			if (mtstate->mt_partition_dispatch_info == NULL)
+			{
+				ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+				Relation root_rel;
+
+				/*
+				 * If this is a partitioned table, we need to open the root
+				 * table RT index which is at the head of partitioned_rels
+				 */
+				if (node->partitioned_rels)
+				{
+					Index	root_rti;
+					Oid		root_oid;
+
+					root_rti = linitial_int(node->partitioned_rels);
+					root_oid = getrelid(root_rti, estate->es_range_table);
+					root_rel = heap_open(root_oid, NoLock);	/* locked by InitPlan */
+				}
+				else /* this may be a leaf partition */
+					root_rel = mtstate->resultRelInfo->ri_RelationDesc;
+
+				/* If it's not a partitioned table after all; bail out. */
+				if (root_rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+					ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+				ExecSetupPartitionTupleRouting(
+									root_rel,
+									&mtstate->mt_partition_dispatch_info,
+									&mtstate->mt_partitions,
+									&mtstate->mt_partition_tupconv_maps,
+									&mtstate->mt_partition_tuple_slot,
+									&mtstate->mt_num_dispatch,
+									&mtstate->mt_num_partitions);
+
+				/* Build WITH CHECK OPTION constraints for leaf partitions */
+				ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+
+				/* Build a projection for each leaf partition rel. */
+				ExecInitPartitionReturningProjection(mtstate, root_rel);
+
+				/* Close the root partitioned rel if we opened it above. */
+				if (root_rel != mtstate->resultRelInfo->ri_RelationDesc)
+					heap_close(root_rel, NoLock);
+			}
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+					   &concurrently_deleted, false, false);
+
+			if (concurrently_deleted)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1312,7 +1393,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1602,12 +1683,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1815,44 +1897,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
-	 * that we didn't build the withCheckOptionList for each partition within
-	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * Build WITH CHECK OPTION constraints for each leaf partition rel. This
+	 * only occurs for INSERT case; UPDATE/DELETE are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
-	{
-		List	   *wcoList;
-
-		Assert(operation == CMD_INSERT);
-		resultRelInfo = mtstate->mt_partitions;
-		wcoList = linitial(node->withCheckOptionLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
-			List	   *wcoExprs = NIL;
-			ListCell   *ll;
-
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
-			{
-				WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
-				ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
-												   mtstate->mt_plans[i]);
-
-				wcoExprs = lappend(wcoExprs, wcoExpr);
-			}
-
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
-			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
-		}
-	}
+	ExecInitPartitionWithCheckOptions(mtstate, rel);
 
 	/*
 	 * Initialize RETURNING projections if needed.
@@ -1861,7 +1909,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1895,28 +1942,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		}
 
 		/*
-		 * Build a projection for each leaf partition rel.  Note that we
-		 * didn't build the returningList for each partition within the
-		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * Build a projection for each leaf partition rel. This only occurs for
+		 * the INSERT case; UPDATE/DELETE are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *rlist;
-
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
-			resultRelInfo->ri_projectReturning =
-				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
-									 resultRelInfo->ri_RelationDesc->rd_att);
-			resultRelInfo++;
-		}
+		ExecInitPartitionReturningProjection(mtstate, rel);
 	}
 	else
 	{
@@ -2143,6 +2172,103 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 }
 
 /* ----------------------------------------------------------------
+ *		ExecInitPartitionWithCheckOptions
+ *
+ * Build WITH CHECK OPTION constraints for each leaf partition rel. Note that
+ * we don't build the withCheckOptionList for each partition within the
+ * planner, but simple translation of the varattnos for each partition
+ * suffices. This only occurs for the INSERT case; UPDATE/DELETE cases are
+ * handled separately.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	List	   *wcoList;
+	int			i;
+
+	if (node->withCheckOptionLists == NIL || mtstate->mt_num_partitions == 0)
+		return;
+
+	wcoList = linitial(node->withCheckOptionLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *mapped_wcoList;
+		List	   *wcoExprs = NIL;
+		ListCell   *ll;
+
+		/* varno = node->nominalRelation */
+		mapped_wcoList = map_partition_varattnos(wcoList,
+												 node->nominalRelation,
+												 partrel, root_rel);
+		foreach(ll, mapped_wcoList)
+		{
+			WithCheckOption *wco = (WithCheckOption *) lfirst(ll);
+			ExprState  *wcoExpr = ExecInitQual((List *) wco->qual,
+										   mtstate->mt_plans[i]);
+
+			wcoExprs = lappend(wcoExprs, wcoExpr);
+		}
+
+		resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+		resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
+		resultRelInfo++;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartitionReturningProjection
+ *
+ * Initialize stuff required to handle RETURNING for leaf partitions.
+ * We don't build the returningList for each partition within the planner, but
+ * simple translation of the varattnos for each partition suffices.  This
+ * actually is helpful only for INSERT case; UPDATE/DELETE are handled
+ * differently.
+ * ----------------------------------------------------------------
+ */
+static void
+ExecInitPartitionReturningProjection(ModifyTableState *mtstate, Relation root_rel)
+{
+	ResultRelInfo  *resultRelInfo = mtstate->mt_partitions;
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	TupleTableSlot *returning_slot = mtstate->ps.ps_ResultTupleSlot;
+	List		   *returningList;
+	int				i;
+
+	/*
+	 * If there is no returning clause, or if we have already initialized the
+	 * returning projection info, there is nothing to be done.
+	 */
+	if (node->returningLists == NIL ||
+		(resultRelInfo && resultRelInfo->ri_projectReturning != NULL) ||
+		mtstate->mt_num_partitions == 0)
+		return;
+
+	returningList = linitial(node->returningLists);
+	for (i = 0; i < mtstate->mt_num_partitions; i++)
+	{
+		Relation	partrel = resultRelInfo->ri_RelationDesc;
+		List	   *rlist;
+
+		/* varno = node->nominalRelation */
+		rlist = map_partition_varattnos(returningList,
+										node->nominalRelation,
+										partrel, root_rel);
+		resultRelInfo->ri_projectReturning =
+			ExecBuildProjectionInfo(rlist,
+									mtstate->ps.ps_ExprContext,
+									returning_slot,
+									&mtstate->ps,
+									resultRelInfo->ri_RelationDesc->rd_att);
+		resultRelInfo++;
+	}
+}
+
+
+/* ----------------------------------------------------------------
  *		ExecEndModifyTable
  *
  *		Shuts down the plan.
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 8cc5f3a..9dd67c9 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -216,6 +219,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..a56afab 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,121 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (b, 12, 116).
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ a | 1 |    
+ a | 4 | 200
+(2 rows)
+
+select * from part_a_10_a_20 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_b_1_b_10 order by 1, 2, 3;
+ a | b |  c  
+---+---+-----
+ b | 7 | 117
+ b | 9 | 125
+(2 rows)
+
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+(2 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+ a | 1 |  
+(1 row)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 11 | 125
+ b | 12 | 116
+ b | 15 | 199
+(3 rows)
+
 -- cleanup
+drop view upview;
 drop table range_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..cda9906 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,61 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
-
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
-
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96; -- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_a_10_a_20 order by 1, 2, 3;
+select * from part_b_1_b_10 order by 1, 2, 3;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select * from part_a_1_a_10 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
 -- cleanup
+drop view upview;
 drop table range_parted;

#72

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#70)

Re: UPDATE of partition key

On Wed, May 24, 2017 at 2:47 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 18 May 2017 at 16:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

If we have to go by this theory, then the option you have preferred
will still execute BR triggers for both delete and insert, so input
row can still be changed twice.

Yeah, right as per my theory above Option3 have the same problem.

But after putting some more thought I realised that only for "Before
Update" or the "Before Insert" trigger row can be changed.

Before Row Delete triggers can suppress the delete operation itself
which is kind of unintended in this case. I think without the user
being aware it doesn't seem advisable to execute multiple BR triggers.

By now, majority of the opinions have shown that they do not favour
two triggers getting fired on a single update. Amit, do you consider
option 2 as a valid option ?

Sounds sensible to me.

That is, fire only UPDATE triggers. BR on
source partition, and AR on destination partition. Do you agree that
firing BR update trigger is essential since it can modify the row and
even prevent the update from happening ?

Agreed.

Apart from above, there is one open issue [1] related to generating an
error for concurrent delete of row for which I have mentioned some way
of getting it done, do you want to try that option and see if you face
any issue in making the progress on that lines?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Kapila (#72)

Re: UPDATE of partition key

On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 24, 2017 at 2:47 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

By now, majority of the opinions have shown that they do not favour
two triggers getting fired on a single update. Amit, do you consider
option 2 as a valid option ?

Sounds sensible to me.

That is, fire only UPDATE triggers. BR on
source partition, and AR on destination partition. Do you agree that
firing BR update trigger is essential since it can modify the row and
even prevent the update from happening ?

Agreed.

Apart from above, there is one open issue [1]

Forget to mention the link, doing it now.

[1]: /messages/by-id/CAA4eK1KEZQ+CyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#73)

Re: UPDATE of partition key

On 24 May 2017 at 20:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Apart from above, there is one open issue [1]

Forget to mention the link, doing it now.

[1] - /messages/by-id/CAA4eK1KEZQ+CyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ@mail.gmail.com

I am not sure right now whether making the t_ctid of such tuples to
Invalid would be a right option, especially because I think there can
be already some other meaning if t_ctid is not valid. But may be we
can check this more.

If we decide to error out using some way, I would be inclined towards
considering re-using some combinations of infomask bits (like
HEAP_MOVED_OFF as suggested upthread) rather than using invalid t_ctid
value.

But I think, we can also take step-by-step approach even for v11. If
we agree that it is ok to silently do the updates as long as we
document the behaviour, we can go ahead and do this, and then as a
second step, implement error handling as a separate patch. If that
patch does not materialize, we at least have the current behaviour
documented.

Ideally, I think we would have liked if we were somehow able to make
the row-movement UPDATE itself abort if it finds any normal
updates waiting for it to finish, rather than making the normal
updates fail because a row-movement occurred . But I think we will
have to live with it.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#74)

Re: UPDATE of partition key

On Mon, May 29, 2017 at 11:20 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 24 May 2017 at 20:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, May 24, 2017 at 8:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Apart from above, there is one open issue [1]

Forget to mention the link, doing it now.

[1] - /messages/by-id/CAA4eK1KEZQ+CyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ@mail.gmail.com

I am not sure right now whether making the t_ctid of such tuples to
Invalid would be a right option, especially because I think there can
be already some other meaning if t_ctid is not valid.

AFAIK, this is used to point to current tuple itself or newer version
of a tuple or is used in speculative inserts (refer comments above
HeapTupleHeaderData in htup_details.h). Can you mention what other
meaning are you referring here for InvalidBlockId in t_ctid?

But may be we
can check this more.

If we decide to error out using some way, I would be inclined towards
considering re-using some combinations of infomask bits (like
HEAP_MOVED_OFF as suggested upthread) rather than using invalid t_ctid
value.

But I think, we can also take step-by-step approach even for v11. If
we agree that it is ok to silently do the updates as long as we
document the behaviour, we can go ahead and do this, and then as a
second step, implement error handling as a separate patch. If that
patch does not materialize, we at least have the current behaviour
documented.

I think that is sensible approach if we find the second step involves
big or complicated changes.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#75)

Re: UPDATE of partition key

On Mon, May 29, 2017 at 5:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

But I think, we can also take step-by-step approach even for v11. If
we agree that it is ok to silently do the updates as long as we
document the behaviour, we can go ahead and do this, and then as a
second step, implement error handling as a separate patch. If that
patch does not materialize, we at least have the current behaviour
documented.

I think that is sensible approach if we find the second step involves
big or complicated changes.

I think it is definitely a good idea to separate the two patches.
UPDATE tuple routing without any special handling for the EPQ issue is
just a partitioning feature. The proposed handling for the EPQ issue
is an *on-disk format change*. That turns a patch which is subject
only to routine bugs into one which can eat your data permanently --
so having the "can eat your data permanently" separated out for both
review and commit seems only prudent. For me, it's not a matter of
which patch is big or complicated, but rather a matter of one of them
being a whole lot riskier than the other. Even UPDATE tuple routing
could mess things up pretty seriously if we end up with tuples in the
wrong partition, of course, but the other thing is still worse.

In terms of a development plan, I think we would need to have both
patches before either could be committed. I believe that everyone
other than me who has expressed an opinion on this issue has said that
it's unacceptable to just ignore the issue, so it doesn't sound like
there will be much appetite for having #1 go into the tree without #2.
I'm still really concerned about that approach because we do not have
very much bit space left and WARM wants to use quite a bit of it. I
think it's quite possible that we'll be sad in the future if we find
that we can't implement feature XYZ because of the bit-space consumed
by this feature. However, I don't have the only vote here and I'm not
going to try to shove this into the tree over multiple objections
(unless there are a lot more votes the other way, but so far there's
no sign of that).

Greg/Amit's idea of using the CTID field rather than an infomask bit
seems like a possibly promising approach. Not everything that needs
bit-space can use the CTID field, so using it is a little less likely
to conflict with something else we want to do in the future than using
a precious infomask bit. However, I'm worried about this:

/* Make sure there is no forward chain link in t_ctid */
tp.t_data->t_ctid = tp.t_self;

The comment does not say *why* we need to make sure that there is no
forward chain link, but it implies that some code somewhere in the
system does or at one time did depend on no forward link existing.
Any such code that still exists will need to be updated. Anybody know
what code that might be, exactly?

The other potential issue I see here is that I know the WARM code also
tries to use the bit-space in the CTID field; in particular, it uses
the CTID field of the last tuple in a HOT chain to point back to the
root of the chain. That seems like it could conflict with the usage
proposed here, but I'm not totally sure. Has anyone investigated this
issue?

Regarding the trigger issue, I can't claim to have a terribly strong
opinion on this. I think that practically anything we do here might
upset somebody, but probably any halfway-reasonable thing we choose to
do will be OK for most people. However, there seems to be a
discrepancy between the approach that got the most votes and the one
that is implemented by the v8 patch, so that seems like something to
fix.

For what it's worth, in the future, I imagine that we might allow
adding a trigger to a partitioned table and having that cascade down
to all descendant tables. In that world, firing the BR UPDATE trigger
for the old partition and the AR UPDATE trigger for the new partition
will look a lot like the behavior the user would expect on an
unpartitioned table, which could be viewed as a good thing. On the
other hand, it's still going to be a DELETE+INSERT under the hood for
the foreseeable future, so firing the delete triggers and then the
insert triggers is also defensible. Is there any big difference
between these appraoches in terms of how much code is required to make
this work?

In terms of the approach taken by the patch itself, it seems
surprising to me that the patch only calls
ExecSetupPartitionTupleRouting when an update fails the partition
constraint. Note that in the insert case, we call that function at
the start of execution; calling it in the middle seems to involve
additional hazards; for example, is it really safe to add additional
ResultRelInfos midway through the operation? Is it safe to take more
locks midway through the operation? It seems like it might be a lot
safer to decide at the beginning of the operation whether this is
needed -- we can skip it if none of the columns involved in the
partition key (or partition key expressions) are mentioned in the
update. (There's also the issue of triggers, but I'm not sure that
it's sensible to allow a trigger on an individual partition to reroute
an update to another partition; what if we get an infinite loop?)

+ if (concurrently_deleted)
+ return NULL;

I don't understand the motivation for this change, and there are no
comments explaining it that I can see.

Perhaps the concurrency-related (i.e. EPQ) behavior here could be
tested via the isolation tester.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#76)

Re: UPDATE of partition key

On 1 June 2017 at 03:25, Robert Haas <robertmhaas@gmail.com> wrote:

Greg/Amit's idea of using the CTID field rather than an infomask bit
seems like a possibly promising approach. Not everything that needs
bit-space can use the CTID field, so using it is a little less likely
to conflict with something else we want to do in the future than using
a precious infomask bit. However, I'm worried about this:

/* Make sure there is no forward chain link in t_ctid */
tp.t_data->t_ctid = tp.t_self;

The comment does not say *why* we need to make sure that there is no
forward chain link, but it implies that some code somewhere in the
system does or at one time did depend on no forward link existing.
Any such code that still exists will need to be updated. Anybody know
what code that might be, exactly?

I am going to have a look overall at this approach, and about code
somewhere else which might be assuming that t_ctid cannot be Invalid.

Regarding the trigger issue, I can't claim to have a terribly strong
opinion on this. I think that practically anything we do here might
upset somebody, but probably any halfway-reasonable thing we choose to
do will be OK for most people. However, there seems to be a
discrepancy between the approach that got the most votes and the one
that is implemented by the v8 patch, so that seems like something to
fix.

Yes, I have started working on updating the patch to use that approach
(BR and AR update triggers on source and destination partition
respectively, instead of delete+insert) The approach taken by the
patch (BR update + delete+insert triggers) didn't require any changes
in the way ExecDelete() and ExecInsert() were called. Now we would
require to skip the delete/insert triggers, so some flags need to be
passed to these functions, or else have stripped down versions of
ExecDelete() and ExecInsert() which don't do other things like
RETURNING handling and firing triggers.

For what it's worth, in the future, I imagine that we might allow
adding a trigger to a partitioned table and having that cascade down
to all descendant tables. In that world, firing the BR UPDATE trigger
for the old partition and the AR UPDATE trigger for the new partition
will look a lot like the behavior the user would expect on an
unpartitioned table, which could be viewed as a good thing. On the
other hand, it's still going to be a DELETE+INSERT under the hood for
the foreseeable future, so firing the delete triggers and then the
insert triggers is also defensible.

Ok, I was assuming that there won't be any plans to support triggers
on a partitioned table, but yes, I had imagined how the behaviour
would be in this world. Currently, users who want to have triggers on
a table that happens to be a partitioned table, have to install the
same trigger on each of the leaf partitions, since there is no other
choice. But we would never know whether a trigger on a leaf partition
was actually meant to be specifically on that individual partition or
it was actually meant to be a trigger on a root partitioned table.
Hence there is the difficulty of deciding the right behaviour in case
of triggers with row movement.

If we have an AR UPDATE trigger on root table, then during row
movement, it does not matter whether we fire the trigger on source or
destination, because it is the same single trigger cascaded on both
the partitions. If there is a trigger installed specifically on a leaf
partition, then we know that it should not be fired on other
partitions since it is specifically made for this one. And same
applies for delete and insert triggers: If installed on parent, don't
involve them in row-movement; only fire them if installed on leaf
partitions regardless of whether it was an internally generated
delete+insert due to row-movement). Similarly we can think about BR
triggers.

Of courses, DBAs should be aware of triggers that are already
installed in the table ancestors before installing a new one on a
child table.

Overall, it becomes much clearer what to do if we decide to allow
triggers on partitioned tables.

Is there any big difference between these appraoches in terms
of how much code is required to make this work?

You mean if we allow triggers on partitioned tables ? I think we would
have to keep some flag in the trigger data (or somewhere else) that
the trigger actually belongs to upper partitioned table, and so for
delete+insert, don't fire such trigger. Other than that, we don't have
to decide in any unique way which trigger to fire on which table.

In terms of the approach taken by the patch itself, it seems
surprising to me that the patch only calls
ExecSetupPartitionTupleRouting when an update fails the partition
constraint. Note that in the insert case, we call that function at
the start of execution;

calling it in the middle seems to involve additional hazards;
for example, is it really safe to add additional
ResultRelInfos midway through the operation?

I thought since the additional ResultRelInfos go into
mtstate->mt_partitions which is independent of
estate->es_result_relations, that should be safe.

Is it safe to take more locks midway through the operation?

I can imagine some rows already updated, when other tasks like ALTER
TABLE or CREATE INDEX happen on other partitions which are still
unlocked, and then for row movement we try to lock these other
partitions and wait for the DDL tasks to complete. But I didn't see
any particular issues with that. But correct me if you suspect a
possible issue. One issue can be if we were able to modify the table
attributes, but I believe we cannot do that for inherited columns.

It seems like it might be a lot
safer to decide at the beginning of the operation whether this is
needed -- we can skip it if none of the columns involved in the
partition key (or partition key expressions) are mentioned in the
update.
(There's also the issue of triggers,

The reason I thought it cannot be done at the start of the execution,
is because even if we know that update is not modifying the partition
key column, we are not certain that the final NEW row has its
partition key column unchanged, because of triggers. I understand it
might be weird for a user requiring to modify a partition key value,
but if a user does that, it will result in crash because we won't have
the partition routing setup, thinking that there is no partition key
column in the UPDATE.

And we also cannot unconditionally setup the partition routing on all
updates, for performance reasons.

I'm not sure that it's sensible to allow a trigger on an
individual partition to reroute an update to another partition
what if we get an infinite loop?)

You mean, if the other table has another trigger that will again route
to the original partition ? But this infinite loop problem could occur
even for 2 normal tables ?

+ if (concurrently_deleted)
+ return NULL;

I don't understand the motivation for this change, and there are no
comments explaining it that I can see.

Yeah comments, I think, are missing. I thought in the ExecDelete()
they are there, but they are not.
If a concurrent delete already deleted the row, we should not bother
about moving the row, hence the above code.

Perhaps the concurrency-related (i.e. EPQ) behavior here could be
tested via the isolation tester.

WIll check.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#77)

Re: UPDATE of partition key

On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Regarding the trigger issue, I can't claim to have a terribly strong
opinion on this. I think that practically anything we do here might
upset somebody, but probably any halfway-reasonable thing we choose to
do will be OK for most people. However, there seems to be a
discrepancy between the approach that got the most votes and the one
that is implemented by the v8 patch, so that seems like something to
fix.

Yes, I have started working on updating the patch to use that approach
(BR and AR update triggers on source and destination partition
respectively, instead of delete+insert) The approach taken by the
patch (BR update + delete+insert triggers) didn't require any changes
in the way ExecDelete() and ExecInsert() were called. Now we would
require to skip the delete/insert triggers, so some flags need to be
passed to these functions, or else have stripped down versions of
ExecDelete() and ExecInsert() which don't do other things like
RETURNING handling and firing triggers.

See, that strikes me as a pretty good argument for firing the
DELETE+INSERT triggers...

I'm not wedded to that approach, but "what makes the code simplest?"
is not a bad tiebreak, other things being equal.

In terms of the approach taken by the patch itself, it seems
surprising to me that the patch only calls
ExecSetupPartitionTupleRouting when an update fails the partition
constraint. Note that in the insert case, we call that function at
the start of execution;

calling it in the middle seems to involve additional hazards;
for example, is it really safe to add additional
ResultRelInfos midway through the operation?

I thought since the additional ResultRelInfos go into
mtstate->mt_partitions which is independent of
estate->es_result_relations, that should be safe.

I don't know. That sounds scary to me, but it might be OK. Probably
needs more study.

Is it safe to take more locks midway through the operation?

I can imagine some rows already updated, when other tasks like ALTER
TABLE or CREATE INDEX happen on other partitions which are still
unlocked, and then for row movement we try to lock these other
partitions and wait for the DDL tasks to complete. But I didn't see
any particular issues with that. But correct me if you suspect a
possible issue. One issue can be if we were able to modify the table
attributes, but I believe we cannot do that for inherited columns.

It's just that it's very unlike what we do anywhere else. I don't
have a real specific idea in mind about what might totally break, but
at a minimum it could certainly cause behavior that can't happen
today. Today, if you run a query on some tables, it will block
waiting for any locks at the beginning of the query, and the query
won't begin executing until it has all of the locks. With this
approach, you might block midway through; you might even deadlock
midway through. Maybe that's not overtly broken, but it's at least
got the possibility of being surprising.

Now, I'd actually kind of like to have behavior like this for other
cases, too. If we're inserting one row, can't we just lock the one
partition into which it needs to get inserted, rather than all of
them? But I'm wary of introducing such behavior incidentally in a
patch whose main goal is to allow UPDATE row movement. Figuring out
what could go wrong and fixing it seems like a substantial project all
of its own.

The reason I thought it cannot be done at the start of the execution,
is because even if we know that update is not modifying the partition
key column, we are not certain that the final NEW row has its
partition key column unchanged, because of triggers. I understand it
might be weird for a user requiring to modify a partition key value,
but if a user does that, it will result in crash because we won't have
the partition routing setup, thinking that there is no partition key
column in the UPDATE.

I think we could avoid that issue. Suppose we select the target
partition based only on the original NEW tuple. If a trigger on that
partition subsequently modifies the tuple so that it no longer
satisfies the partition constraint for that partition, just let it
ERROR out normally. Actually, it seems like that's probably the
*easiest* behavior to implement. Otherwise, you might fire triggers,
discover that you need to re-route the tuple, and then ... fire
triggers again on the new partition, which might reroute it again?

I'm not sure that it's sensible to allow a trigger on an
individual partition to reroute an update to another partition
what if we get an infinite loop?)

You mean, if the other table has another trigger that will again route
to the original partition ? But this infinite loop problem could occur
even for 2 normal tables ?

How? For a normal trigger, nothing it does can change which table is targeted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#78)

Re: UPDATE of partition key

On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Regarding the trigger issue, I can't claim to have a terribly strong
opinion on this. I think that practically anything we do here might
upset somebody, but probably any halfway-reasonable thing we choose to
do will be OK for most people. However, there seems to be a
discrepancy between the approach that got the most votes and the one
that is implemented by the v8 patch, so that seems like something to
fix.

Yes, I have started working on updating the patch to use that approach
(BR and AR update triggers on source and destination partition
respectively, instead of delete+insert) The approach taken by the
patch (BR update + delete+insert triggers) didn't require any changes
in the way ExecDelete() and ExecInsert() were called. Now we would
require to skip the delete/insert triggers, so some flags need to be
passed to these functions, or else have stripped down versions of
ExecDelete() and ExecInsert() which don't do other things like
RETURNING handling and firing triggers.

See, that strikes me as a pretty good argument for firing the
DELETE+INSERT triggers...

I'm not wedded to that approach, but "what makes the code simplest?"
is not a bad tiebreak, other things being equal.

Yes, that sounds good to me. But I think we want to wait for other's
opinion because it is quite understandable that two triggers firing on
the same partition sounds odd.

In terms of the approach taken by the patch itself, it seems
surprising to me that the patch only calls
ExecSetupPartitionTupleRouting when an update fails the partition
constraint. Note that in the insert case, we call that function at
the start of execution;

calling it in the middle seems to involve additional hazards;
for example, is it really safe to add additional
ResultRelInfos midway through the operation?

I thought since the additional ResultRelInfos go into
mtstate->mt_partitions which is independent of
estate->es_result_relations, that should be safe.

I don't know. That sounds scary to me, but it might be OK. Probably
needs more study.

Is it safe to take more locks midway through the operation?

I can imagine some rows already updated, when other tasks like ALTER
TABLE or CREATE INDEX happen on other partitions which are still
unlocked, and then for row movement we try to lock these other
partitions and wait for the DDL tasks to complete. But I didn't see
any particular issues with that. But correct me if you suspect a
possible issue. One issue can be if we were able to modify the table
attributes, but I believe we cannot do that for inherited columns.

It's just that it's very unlike what we do anywhere else. I don't
have a real specific idea in mind about what might totally break, but
at a minimum it could certainly cause behavior that can't happen
today. Today, if you run a query on some tables, it will block
waiting for any locks at the beginning of the query, and the query
won't begin executing until it has all of the locks. With this
approach, you might block midway through; you might even deadlock
midway through. Maybe that's not overtly broken, but it's at least
got the possibility of being surprising.

Now, I'd actually kind of like to have behavior like this for other
cases, too. If we're inserting one row, can't we just lock the one
partition into which it needs to get inserted, rather than all of
them? But I'm wary of introducing such behavior incidentally in a
patch whose main goal is to allow UPDATE row movement. Figuring out
what could go wrong and fixing it seems like a substantial project all
of its own.

Yes, I agree it makes sense trying to avoid introducing something we
haven't tried before, in this patch, as far as possible.

The reason I thought it cannot be done at the start of the execution,
is because even if we know that update is not modifying the partition
key column, we are not certain that the final NEW row has its
partition key column unchanged, because of triggers. I understand it
might be weird for a user requiring to modify a partition key value,
but if a user does that, it will result in crash because we won't have
the partition routing setup, thinking that there is no partition key
column in the UPDATE.

I think we could avoid that issue. Suppose we select the target
partition based only on the original NEW tuple. If a trigger on that
partition subsequently modifies the tuple so that it no longer
satisfies the partition constraint for that partition, just let it
ERROR out normally.

Ok, so you are saying, don't allow a partition trigger to initiate the
row movement. I think we should keep this as a documented restriction.
Actually it would be unfortunate that we would have to keep this
restriction only because of implementation issue.

So, according to that, below would be the logic :

Run partition constraint check on the original NEW row.
If it succeeds :
{
Fire BR UPDATE trigger on the original partition.
Run partition constraint check again with the modified NEW row
(may be do this only if the trigger modified the partition key)
If it fails,
abort.
Else
proceed with the usual local update.
}
else
{
Fire BR UPDATE trigger on original partition.
Find the right partition for the modified NEW row.
If it is the same partition,
proceed with the usual local update.
else
do the row movement.
}

Actually, it seems like that's probably the
*easiest* behavior to implement. Otherwise, you might fire triggers,
discover that you need to re-route the tuple, and then ... fire
triggers again on the new partition, which might reroute it again?

Why would update BR trigger fire on the new partition ? On the new
partition, only BR INSERT trigger would fire if at all we decide to
fire delete+insert triggers. And insert trigger would not again cause
the tuple to be re-routed because it's an insert.

I'm not sure that it's sensible to allow a trigger on an
individual partition to reroute an update to another partition
what if we get an infinite loop?)

You mean, if the other table has another trigger that will again route
to the original partition ? But this infinite loop problem could occur
even for 2 normal tables ?

How? For a normal trigger, nothing it does can change which table is targeted.

I thought you were considering the possibility that on the new
partition, the trigger function itself is running another update stmt,
which is also possible for normal tables .

But now I think you are saying, the row that is being inserted into
the new partition might get again modified by the INSERT trigger on
the new partition, which might in turn cause it to fail the new
partition constraint. But in that case, it will not cause another row
movement, because in the new partition, it's an INSERT, not an UPDATE,
so the operation would end there, aborted.

But correct me if I you were thinking of a different scenario that can
cause infinite loop.

-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#79)

Re: UPDATE of partition key

On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Regarding the trigger issue, I can't claim to have a terribly strong
opinion on this. I think that practically anything we do here might
upset somebody, but probably any halfway-reasonable thing we choose to
do will be OK for most people. However, there seems to be a
discrepancy between the approach that got the most votes and the one
that is implemented by the v8 patch, so that seems like something to
fix.

Yes, I have started working on updating the patch to use that approach
(BR and AR update triggers on source and destination partition
respectively, instead of delete+insert) The approach taken by the
patch (BR update + delete+insert triggers) didn't require any changes
in the way ExecDelete() and ExecInsert() were called. Now we would
require to skip the delete/insert triggers, so some flags need to be
passed to these functions,

I thought you already need to pass an additional flag for special
handling of ctid in Delete case. For Insert, a new flag needs to be
passed and need to have a check for that in few places.

or else have stripped down versions of

ExecDelete() and ExecInsert() which don't do other things like
RETURNING handling and firing triggers.

See, that strikes me as a pretty good argument for firing the
DELETE+INSERT triggers...

I'm not wedded to that approach, but "what makes the code simplest?"
is not a bad tiebreak, other things being equal.

Yes, that sounds good to me.

I am okay if we want to go ahead with firing BR UPDATE + DELETE +
INSERT triggers for an Update statement (when row movement happens) on
the argument of code simplicity, but it sounds slightly odd behavior.

But I think we want to wait for other's
opinion because it is quite understandable that two triggers firing on
the same partition sounds odd.

Yeah, but I think we have to rely on docs in this case as behavior is
not intuitive.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#76)

Re: UPDATE of partition key

On Thu, Jun 1, 2017 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 29, 2017 at 5:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

But I think, we can also take step-by-step approach even for v11. If
we agree that it is ok to silently do the updates as long as we
document the behaviour, we can go ahead and do this, and then as a
second step, implement error handling as a separate patch. If that
patch does not materialize, we at least have the current behaviour
documented.

I think that is sensible approach if we find the second step involves
big or complicated changes.

I think it is definitely a good idea to separate the two patches.
UPDATE tuple routing without any special handling for the EPQ issue is
just a partitioning feature. The proposed handling for the EPQ issue
is an *on-disk format change*. That turns a patch which is subject
only to routine bugs into one which can eat your data permanently --
so having the "can eat your data permanently" separated out for both
review and commit seems only prudent. For me, it's not a matter of
which patch is big or complicated, but rather a matter of one of them
being a whole lot riskier than the other. Even UPDATE tuple routing
could mess things up pretty seriously if we end up with tuples in the
wrong partition, of course, but the other thing is still worse.

In terms of a development plan, I think we would need to have both
patches before either could be committed. I believe that everyone
other than me who has expressed an opinion on this issue has said that
it's unacceptable to just ignore the issue, so it doesn't sound like
there will be much appetite for having #1 go into the tree without #2.
I'm still really concerned about that approach because we do not have
very much bit space left and WARM wants to use quite a bit of it. I
think it's quite possible that we'll be sad in the future if we find
that we can't implement feature XYZ because of the bit-space consumed
by this feature. However, I don't have the only vote here and I'm not
going to try to shove this into the tree over multiple objections
(unless there are a lot more votes the other way, but so far there's
no sign of that).

Greg/Amit's idea of using the CTID field rather than an infomask bit
seems like a possibly promising approach. Not everything that needs
bit-space can use the CTID field, so using it is a little less likely
to conflict with something else we want to do in the future than using
a precious infomask bit. However, I'm worried about this:

/* Make sure there is no forward chain link in t_ctid */
tp.t_data->t_ctid = tp.t_self;

The comment does not say *why* we need to make sure that there is no
forward chain link, but it implies that some code somewhere in the
system does or at one time did depend on no forward link existing.

I think it is to ensure that EvalPlanQual mechanism gets invoked in
the right case. The visibility routine will return HeapTupleUpdated
both when the tuple is deleted or updated (updated - has a newer
version of the tuple), so we use ctid to decide if we need to follow
the tuple chain for a newer version of the tuple.

Any such code that still exists will need to be updated.

Yeah.

The other potential issue I see here is that I know the WARM code also
tries to use the bit-space in the CTID field; in particular, it uses
the CTID field of the last tuple in a HOT chain to point back to the
root of the chain. That seems like it could conflict with the usage
proposed here, but I'm not totally sure.

The proposed change in WARM tuple patch uses ip_posid field of CTID
and we are planning to use ip_blkid field. Here is the relevant text
and code from WARM tuple patch:

"Store the root line pointer of the WARM chain in the t_ctid.ip_posid
field of the last tuple in the chain and mark the tuple header with
HEAP_TUPLE_LATEST flag to record that fact."

+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)

For further details, refer patch 0001-Track-root-line-pointer-v23_v26
in the below e-mail:
/messages/by-id/CABOikdOTstHK2y0rDk+Y3Wx9HRe+bZtj3zuYGU=VngneiHo5KQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#80)

Re: UPDATE of partition key

On 5 June 2017 at 11:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 2 June 2017 at 01:17, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Regarding the trigger issue, I can't claim to have a terribly strong
opinion on this. I think that practically anything we do here might
upset somebody, but probably any halfway-reasonable thing we choose to
do will be OK for most people. However, there seems to be a
discrepancy between the approach that got the most votes and the one
that is implemented by the v8 patch, so that seems like something to
fix.

Yes, I have started working on updating the patch to use that approach
(BR and AR update triggers on source and destination partition
respectively, instead of delete+insert) The approach taken by the
patch (BR update + delete+insert triggers) didn't require any changes
in the way ExecDelete() and ExecInsert() were called. Now we would
require to skip the delete/insert triggers, so some flags need to be
passed to these functions,

I thought you already need to pass an additional flag for special
handling of ctid in Delete case.

Yeah that was unavoidable.

For Insert, a new flag needs to be
passed and need to have a check for that in few places.

For skipping delete and insert trigger, we need to include still
another flag, and checks in both ExecDelete() and ExecInsert() for
skipping both BR and AR trigger, and then in ExecUpdate(), again a
call to ExecARUpdateTriggers() before quitting.

or else have stripped down versions of

ExecDelete() and ExecInsert() which don't do other things like
RETURNING handling and firing triggers.

See, that strikes me as a pretty good argument for firing the
DELETE+INSERT triggers...

I'm not wedded to that approach, but "what makes the code simplest?"
is not a bad tiebreak, other things being equal.

Yes, that sounds good to me.

I am okay if we want to go ahead with firing BR UPDATE + DELETE +
INSERT triggers for an Update statement (when row movement happens) on
the argument of code simplicity, but it sounds slightly odd behavior.

Ok. Will keep this behaviour that is already present in the patch. I
myself also feel that code simplicity can be used as a tie-breaker if
a single behaviour cannot be agreed upon that completely satisfies
all aspects.

But I think we want to wait for other's
opinion because it is quite understandable that two triggers firing on
the same partition sounds odd.

Yeah, but I think we have to rely on docs in this case as behavior is
not intuitive.

Agreed. The doc changes in the patch already has explained in detail
this behaviour.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#79)

Re: UPDATE of partition key

On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

So, according to that, below would be the logic :

Run partition constraint check on the original NEW row.
If it succeeds :
{
Fire BR UPDATE trigger on the original partition.
Run partition constraint check again with the modified NEW row
(may be do this only if the trigger modified the partition key)
If it fails,
abort.
Else
proceed with the usual local update.
}
else
{
Fire BR UPDATE trigger on original partition.
Find the right partition for the modified NEW row.
If it is the same partition,
proceed with the usual local update.
else
do the row movement.
}

Sure, that sounds about right, although the "Fire BR UPDATE trigger on
the original partition." is the same in both branches, so I'm not
quite sure why you have that in the "if" block.

Actually, it seems like that's probably the
*easiest* behavior to implement. Otherwise, you might fire triggers,
discover that you need to re-route the tuple, and then ... fire
triggers again on the new partition, which might reroute it again?

Why would update BR trigger fire on the new partition ? On the new
partition, only BR INSERT trigger would fire if at all we decide to
fire delete+insert triggers. And insert trigger would not again cause
the tuple to be re-routed because it's an insert.

OK, sure, that makes sense. I guess it's really the insert case that
I was worried about -- if we have a BEFORE ROW INSERT trigger and it
changes the tuple and we reroute it, I think we'd have to fire the
BEFORE ROW INSERT on the new partition, which might change the tuple
again and cause yet another reroute, and in this worst case this is an
infinite loop. But it sounds like we're going to fix that problem --
I think correctly -- by only ever allowing the tuple to be routed
once. If some trigger tries to make a change the tuple after that
such that re-routing is required, they get an error. And what you are
describing here seems like it will be fine.

But now I think you are saying, the row that is being inserted into
the new partition might get again modified by the INSERT trigger on
the new partition, which might in turn cause it to fail the new
partition constraint. But in that case, it will not cause another row
movement, because in the new partition, it's an INSERT, not an UPDATE,
so the operation would end there, aborted.

Yeah, that's what I was worried about. I didn't want a row movement
to be able to trigger another row movement and so on ad infinitum.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#81)

Re: UPDATE of partition key

On Mon, Jun 5, 2017 at 2:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Greg/Amit's idea of using the CTID field rather than an infomask bit
seems like a possibly promising approach. Not everything that needs
bit-space can use the CTID field, so using it is a little less likely
to conflict with something else we want to do in the future than using
a precious infomask bit. However, I'm worried about this:

/* Make sure there is no forward chain link in t_ctid */
tp.t_data->t_ctid = tp.t_self;

The comment does not say *why* we need to make sure that there is no
forward chain link, but it implies that some code somewhere in the
system does or at one time did depend on no forward link existing.

I think it is to ensure that EvalPlanQual mechanism gets invoked in
the right case. The visibility routine will return HeapTupleUpdated
both when the tuple is deleted or updated (updated - has a newer
version of the tuple), so we use ctid to decide if we need to follow
the tuple chain for a newer version of the tuple.

That would explain why need to make sure that there *is* a forward
chain link in t_ctid for an update, but it doesn't explain why we need
to make sure that there *isn't* a forward link for delete.

The proposed change in WARM tuple patch uses ip_posid field of CTID
and we are planning to use ip_blkid field. Here is the relevant text
and code from WARM tuple patch:

"Store the root line pointer of the WARM chain in the t_ctid.ip_posid
field of the last tuple in the chain and mark the tuple header with
HEAP_TUPLE_LATEST flag to record that fact."
+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)
For further details, refer patch 0001-Track-root-line-pointer-v23_v26
in the below e-mail:
/messages/by-id/CABOikdOTstHK2y0rDk+Y3Wx9HRe+bZtj3zuYGU=VngneiHo5KQ@mail.gmail.com

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#84)

Re: UPDATE of partition key

On Tue, Jun 6, 2017 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 5, 2017 at 2:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Greg/Amit's idea of using the CTID field rather than an infomask bit
seems like a possibly promising approach. Not everything that needs
bit-space can use the CTID field, so using it is a little less likely
to conflict with something else we want to do in the future than using
a precious infomask bit. However, I'm worried about this:

/* Make sure there is no forward chain link in t_ctid */
tp.t_data->t_ctid = tp.t_self;

The comment does not say *why* we need to make sure that there is no
forward chain link, but it implies that some code somewhere in the
system does or at one time did depend on no forward link existing.

I think it is to ensure that EvalPlanQual mechanism gets invoked in
the right case. The visibility routine will return HeapTupleUpdated
both when the tuple is deleted or updated (updated - has a newer
version of the tuple), so we use ctid to decide if we need to follow
the tuple chain for a newer version of the tuple.

That would explain why need to make sure that there *is* a forward
chain link in t_ctid for an update, but it doesn't explain why we need
to make sure that there *isn't* a forward link for delete.

As far as I understand, it is to ensure that for deleted rows, nothing
more needs to be done. For example, see the below check in
ExecUpdate/ExecDelete.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

Also a similar check in ExecLockRows. Now for deleted rows, if the
t_ctid wouldn't point to itself, then in the mentioned functions, we
were not in a position to conclude that the row is deleted.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#83)

1 attachment(s)

Re: UPDATE of partition key

On 6 June 2017 at 23:52, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 2, 2017 at 7:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

So, according to that, below would be the logic :

Run partition constraint check on the original NEW row.
If it succeeds :
{
Fire BR UPDATE trigger on the original partition.
Run partition constraint check again with the modified NEW row
(may be do this only if the trigger modified the partition key)
If it fails,
abort.
Else
proceed with the usual local update.
}
else
{
Fire BR UPDATE trigger on original partition.
Find the right partition for the modified NEW row.
If it is the same partition,
proceed with the usual local update.
else
do the row movement.
}

Sure, that sounds about right, although the "Fire BR UPDATE trigger on
the original partition." is the same in both branches, so I'm not
quite sure why you have that in the "if" block.

Actually after coding this logic, it looks a bit different. See
ExecUpdate() in the attached file trigger_related_changes.patch

----

Now that we are making sure trigger won't change the partition of the
tuple, next thing we need to do is, make sure the tuple routing setup
is done *only* if the UPDATE is modifying partition keys. Otherwise,
this will degrade normal update performance.

Below is the logic I am implementing for determining whether the
UPDATE is modifying partition keys.

In ExecInitModifyTable() ...
Call GetUpdatedColumns(mtstate->rootResultRelInfo, estate) to get
updated_columns.
For each of the updated_columns :
{
Check if the column is part of partition key quals of any of
the relations in mtstate->resultRelInfo[] array.
/*
* mtstate->resultRelInfo[] contains exactly those leaf partitions
* which qualify the update quals.
*/

If (it is part of partition key quals of at least one of the relations)
{
Do ExecSetupPartitionTupleRouting() for the root partition.
break;
}
}

Few things need to be considered :

Use Relation->rd_partcheck to get partition check quals of each of the
relations in mtstate->resultRelInfo[].

The Relation->rd_partcheck of the leaf partitions would include the
ancestors' partition quals as well. So we are good: we don't have to
explicitly get the upper partition constraints. Note that an UPDATE
can modify a column which is not used in a partition constraint
expressions of any of the partitions or partitioned tables in the
subtree, but that column may have been used in partition constraint of
a partitioned table belonging to upper subtree.

All of the relations in mtstate->resultRelInfo are already open. So we
don't need to re-open any more relations to get the partition quals.

The column bitmap set returned by GetUpdatedColumns() refer to
attribute numbers w.r.t. to the root partition. And the
mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
we need to do something similar to map_partition_varattnos() to change
the updated columns attnos to the leaf partitions and walk down the
partition constraint expressions to find if the attnos are present
there.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

trigger_related_changes.patchapplication/octet-stream; name=trigger_related_changes.patchDownload

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index cf555fe..be57d3e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -434,7 +434,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * Check the constraints of the tuple
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -624,6 +624,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -632,6 +634,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -775,6 +780,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -798,8 +805,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -877,7 +884,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -891,6 +899,8 @@ ExecUpdate(ItemPointer tupleid,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	bool		partition_check_passed = true;
+	bool		has_br_trigger;
 
 	/*
 	 * abort the operation if not running transactions
@@ -911,16 +921,56 @@ ExecUpdate(ItemPointer tupleid,
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
 
 	/* BEFORE ROW UPDATE Triggers */
-	if (resultRelInfo->ri_TrigDesc &&
-		resultRelInfo->ri_TrigDesc->trig_update_before_row)
+	has_br_trigger = (resultRelInfo->ri_TrigDesc &&
+					  resultRelInfo->ri_TrigDesc->trig_update_before_row);
+
+	if (has_br_trigger)
 	{
-		slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-									tupleid, oldtuple, slot);
+		TupleTableSlot *trig_slot;
 
-		if (slot == NULL)		/* "do nothing" */
+		trig_slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
+										 tupleid, oldtuple, slot);
+
+		if (trig_slot == NULL)		/* "do nothing" */
 			return NULL;
 
+		if (resultRelInfo->ri_PartitionCheck)
+		{
+			bool		partition_check_passed_with_trig_tuple;
+
+			partition_check_passed =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+			partition_check_passed_with_trig_tuple =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+
+			if (partition_check_passed)
+			{
+				/*
+				 * If it's the trigger that is causing partition constraint
+				 * violation, abort. We don't want a trigger to cause tuple
+				 * routing.
+				 */
+				if (!partition_check_passed_with_trig_tuple)
+					ExecPartitionCheckEmitError(resultRelInfo,
+												trig_slot, estate);
+			}
+			else
+			{
+				/*
+				 * Partition constraint failed with original NEW tuple. But the
+				 * trigger might even have modifed the tuple such that it fits
+				 * back into the partition. So partition constraint check
+				 * should be based on *final* NEW tuple.
+				 */
+				partition_check_passed = partition_check_passed_with_trig_tuple;
+			}
+		}
+
 		/* trigger might have changed tuple */
+		slot = trig_slot;
 		tuple = ExecMaterializeSlot(slot);
 	}
 
@@ -987,12 +1037,48 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition. With a BR trigger, the tuple has already gone through EPQ
+		 * and has been locked; so it won't change again. So, avoid an extra
+		 * partition check if we already did it above in the presence of BR
+		 * triggers.
+		 */
+		if (!has_br_trigger)
+		{
+			partition_check_passed =
+				(!resultRelInfo->ri_PartitionCheck ||
+				ExecPartitionCheck(resultRelInfo, slot, estate));
+		}
+
+		if (!partition_check_passed)
+		{
+			bool	concurrently_deleted;
+
+			Assert(mtstate->mt_partition_dispatch_info != NULL);
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+					   &concurrently_deleted, false, false);
+
+			if (concurrently_deleted)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1312,7 +1398,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1602,12 +1688,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");

#87

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#86)

Re: UPDATE of partition key

On 7 June 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The column bitmap set returned by GetUpdatedColumns() refer to
attribute numbers w.r.t. to the root partition. And the
mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
we need to do something similar to map_partition_varattnos() to change
the updated columns attnos to the leaf partitions

I was wrong about this. Each of the mtstate->resultRelInfo[] has its
own corresponding RangeTblEntry with its own updatedCols having attnos
accordingly adjusted to refer its own table attributes. So we don't
have to do the mapping; we need to get modifedCols separately for each
of the ResultRelInfo, rather than the root relinfo.

and walk down the
partition constraint expressions to find if the attnos are present
there.

But this we will need to do.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#85)

Re: UPDATE of partition key

On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As far as I understand, it is to ensure that for deleted rows, nothing
more needs to be done. For example, see the below check in
ExecUpdate/ExecDelete.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

Also a similar check in ExecLockRows. Now for deleted rows, if the
t_ctid wouldn't point to itself, then in the mentioned functions, we
were not in a position to conclude that the row is deleted.

Right, so we would have to find all such checks and change them to use
some other method to conclude that the row is deleted. What method
would we use?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#88)

Re: UPDATE of partition key

On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As far as I understand, it is to ensure that for deleted rows, nothing
more needs to be done. For example, see the below check in
ExecUpdate/ExecDelete.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

Also a similar check in ExecLockRows. Now for deleted rows, if the
t_ctid wouldn't point to itself, then in the mentioned functions, we
were not in a position to conclude that the row is deleted.

Right, so we would have to find all such checks and change them to use
some other method to conclude that the row is deleted. What method
would we use?

I think before doing above check we can simply check if ctid.ip_blkid
contains InvalidBlockNumber, then return an error.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#89)

Re: UPDATE of partition key

On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As far as I understand, it is to ensure that for deleted rows, nothing
more needs to be done. For example, see the below check in
ExecUpdate/ExecDelete.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

Also a similar check in ExecLockRows. Now for deleted rows, if the
t_ctid wouldn't point to itself, then in the mentioned functions, we
were not in a position to conclude that the row is deleted.

Right, so we would have to find all such checks and change them to use
some other method to conclude that the row is deleted. What method
would we use?

I think before doing above check we can simply check if ctid.ip_blkid
contains InvalidBlockNumber, then return an error.

Hmm, OK. That case never happens today?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#87)

1 attachment(s)

Re: UPDATE of partition key

On 7 June 2017 at 20:19, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 7 June 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The column bitmap set returned by GetUpdatedColumns() refer to
attribute numbers w.r.t. to the root partition. And the
mstate->resultRelInfo[] have attnos w.r.t. to the leaf partitions. So
we need to do something similar to map_partition_varattnos() to change
the updated columns attnos to the leaf partitions

I was wrong about this. Each of the mtstate->resultRelInfo[] has its
own corresponding RangeTblEntry with its own updatedCols having attnos
accordingly adjusted to refer its own table attributes. So we don't
have to do the mapping; we need to get modifedCols separately for each
of the ResultRelInfo, rather than the root relinfo.

and walk down the
partition constraint expressions to find if the attnos are present
there.

But this we will need to do.

Attached is v9 patch. This covers the two parts discussed upthread :
1. Prevent triggers from causing the row movement.
2. Setup the tuple routing in ExecInitModifyTable(), but only if a
partition key is modified. Check new function IsPartitionKeyUpdate().

Have rebased the patch to consider changes done in commit
15ce775faa428dc9 to prevent triggers from violating partition
constraints. There, for the call to ExecFindPartition() in ExecInsert,
we need to fetch the mtstate->rootResultRelInfo in case the operation
is part of update row movement. This is because the root partition is
not available in the resultRelInfo for UPDATE.

Added many more test scenarios in update.sql that cover the above.

I am yet to test the concurrency part using isolation tester.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v9.patchapplication/octet-stream; name=update-partition-key_v9.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index ec015e9..9a46c1e 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2992,6 +2992,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3284,9 +3289,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 0a33c40..a2d84ed 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2658,7 +2658,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 3caeeac..b481e67 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -103,8 +103,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				  TupleTableSlot *slot, EState *estate);
 
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
@@ -1823,15 +1821,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1859,51 +1852,65 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
-		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
 								 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+		if (map != NULL)
+		{
+			tuple = do_convert_tuple(tuple, map);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1911,7 +1918,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2024,8 +2032,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3312,8 +3321,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple it if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index c6a66b6..7e82482 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index bf26488..f3995f5 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,6 +54,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
@@ -281,6 +284,14 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs to
+		 * be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -290,7 +301,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 										 mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -451,7 +462,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -641,6 +652,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -649,6 +662,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -792,6 +808,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -815,8 +833,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -894,7 +912,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -908,6 +927,8 @@ ExecUpdate(ItemPointer tupleid,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	bool		partition_check_passed = true;
+	bool		has_br_trigger;
 
 	/*
 	 * abort the operation if not running transactions
@@ -928,16 +949,56 @@ ExecUpdate(ItemPointer tupleid,
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
 
 	/* BEFORE ROW UPDATE Triggers */
-	if (resultRelInfo->ri_TrigDesc &&
-		resultRelInfo->ri_TrigDesc->trig_update_before_row)
+	has_br_trigger = (resultRelInfo->ri_TrigDesc &&
+					  resultRelInfo->ri_TrigDesc->trig_update_before_row);
+
+	if (has_br_trigger)
 	{
-		slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-									tupleid, oldtuple, slot);
+		TupleTableSlot *trig_slot;
 
-		if (slot == NULL)		/* "do nothing" */
+		trig_slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
+										 tupleid, oldtuple, slot);
+
+		if (trig_slot == NULL)		/* "do nothing" */
 			return NULL;
 
+		if (resultRelInfo->ri_PartitionCheck)
+		{
+			bool		partition_check_passed_with_trig_tuple;
+
+			partition_check_passed =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+			partition_check_passed_with_trig_tuple =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+
+			if (partition_check_passed)
+			{
+				/*
+				 * If it's the trigger that is causing partition constraint
+				 * violation, abort. We don't want a trigger to cause tuple
+				 * routing.
+				 */
+				if (!partition_check_passed_with_trig_tuple)
+					ExecPartitionCheckEmitError(resultRelInfo,
+												trig_slot, estate);
+			}
+			else
+			{
+				/*
+				 * Partition constraint failed with original NEW tuple. But the
+				 * trigger might even have modifed the tuple such that it fits
+				 * back into the partition. So partition constraint check
+				 * should be based on *final* NEW tuple.
+				 */
+				partition_check_passed = partition_check_passed_with_trig_tuple;
+			}
+		}
+
 		/* trigger might have changed tuple */
+		slot = trig_slot;
 		tuple = ExecMaterializeSlot(slot);
 	}
 
@@ -1004,12 +1065,60 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition. With a BR trigger, the tuple has already gone through EPQ
+		 * and has been locked; so it won't change again. So, avoid an extra
+		 * partition check if we already did it above in the presence of BR
+		 * triggers.
+		 */
+		if (!has_br_trigger)
+		{
+			partition_check_passed =
+				(!resultRelInfo->ri_PartitionCheck ||
+				ExecPartitionCheck(resultRelInfo, slot, estate));
+		}
+
+		if (!partition_check_passed)
+		{
+			bool	concurrently_deleted;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with partition
+			 * constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+					   &concurrently_deleted, false, false);
+
+			/*
+			 * The row was already deleted by a concurrent DELETE. So we don't
+			 * have anything to update.
+			 */
+			if (concurrently_deleted)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1329,7 +1438,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1411,6 +1520,35 @@ fireASTriggers(ModifyTableState *node)
 	}
 }
 
+/*
+ * Check whether partition key is modified for any of the relations.
+ */
+static bool
+IsPartitionKeyUpdate(EState *estate, ResultRelInfo *result_rels, int num_rels)
+{
+	int		i;
+
+	/*
+	 * Each of the result relations has the updated columns set stored
+	 * according to its own column ordering. So we need to pull the attno of
+	 * the partition quals of each of the relations, and check if the updated
+	 * column attributes are present in the vars in the partition quals.
+	 */
+	for (i = 0; i < num_rels; i++)
+	{
+		ResultRelInfo *resultRelInfo = &result_rels[i];
+		Relation		rel = resultRelInfo->ri_RelationDesc;
+		Bitmapset	  *expr_attrs = NULL;
+
+		pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+		/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+		if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+			return true;
+	}
+
+	return false;
+}
 
 /* ----------------------------------------------------------------
  *	   ExecModifyTable
@@ -1619,12 +1757,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1780,9 +1919,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT ||
+		 (operation == CMD_UPDATE &&
+		  IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans))))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo *partitions;
@@ -1842,7 +1986,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		List	   *wcoList;
 
-		Assert(operation == CMD_INSERT);
+		Assert(operation == CMD_INSERT || operation == CMD_UPDATE);
 		resultRelInfo = mtstate->mt_partitions;
 		wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 8cc5f3a..9dd67c9 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -216,6 +219,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..170c448 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,175 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (b, 12, 116).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select * from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
 ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
+drop view upview;
 drop table range_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_parted" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+ERROR:  new row for relation "sub_part1" violates partition constraint
+DETAIL:  Failing row contains (2, 70, 1).
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..2f9bad0 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,121 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE c > 120 WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 partition of part_b_10_b_20 for values from (1) to (100);
+create table part_c_100_200 partition of part_b_10_b_20 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
-
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 values ('b', 12, 96);
+insert into part_c_1_100 values ('b', 13, 97);
+insert into part_c_100_200 values ('b', 15, 105);
+insert into part_c_100_200 values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 ;
+select * from part_c_1_100 order by 1, 2, 3;
+select * from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
+drop view upview;
 drop table range_parted;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
+
+

#92

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#90)

Re: UPDATE of partition key

On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As far as I understand, it is to ensure that for deleted rows, nothing
more needs to be done. For example, see the below check in
ExecUpdate/ExecDelete.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

Also a similar check in ExecLockRows. Now for deleted rows, if the
t_ctid wouldn't point to itself, then in the mentioned functions, we
were not in a position to conclude that the row is deleted.

Right, so we would have to find all such checks and change them to use
some other method to conclude that the row is deleted. What method
would we use?

I think before doing above check we can simply check if ctid.ip_blkid
contains InvalidBlockNumber, then return an error.

Hmm, OK. That case never happens today?

As per my understanding that case doesn't exist. I will verify again
once the patch is available. I can take a crack at it if Amit
Khandekar is busy with something else or is not comfortable in this
area.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#92)

Re: UPDATE of partition key

On 9 June 2017 at 19:10, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 8, 2017 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 7, 2017 at 5:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As far as I understand, it is to ensure that for deleted rows, nothing
more needs to be done. For example, see the below check in
ExecUpdate/ExecDelete.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

Also a similar check in ExecLockRows. Now for deleted rows, if the
t_ctid wouldn't point to itself, then in the mentioned functions, we
were not in a position to conclude that the row is deleted.

Right, so we would have to find all such checks and change them to use
some other method to conclude that the row is deleted. What method
would we use?

I think before doing above check we can simply check if ctid.ip_blkid
contains InvalidBlockNumber, then return an error.

Hmm, OK. That case never happens today?

As per my understanding that case doesn't exist. I will verify again
once the patch is available. I can take a crack at it if Amit
Khandekar is busy with something else or is not comfortable in this
area.

Amit, I was going to have a look at this, once I finish with the other
part. I was busy on getting that done first. But your comments/help
are always welcome.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#93)

Re: UPDATE of partition key

On Fri, Jun 9, 2017 at 7:48 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 9 June 2017 at 19:10, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 8, 2017 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 8, 2017 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think before doing above check we can simply check if ctid.ip_blkid
contains InvalidBlockNumber, then return an error.

Hmm, OK. That case never happens today?

As per my understanding that case doesn't exist. I will verify again
once the patch is available. I can take a crack at it if Amit
Khandekar is busy with something else or is not comfortable in this
area.

Amit, I was going to have a look at this, once I finish with the other
part.

Sure, will wait for your patch to be available. I can help by
reviewing the same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#91)

Re: UPDATE of partition key

While rebasing my patch for the below recent commit, I realized that a
similar issue exists for the uptate-tuple-routing patch as well :

commit 78a030a441966d91bc7e932ef84da39c3ea7d970
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Jun 12 23:29:44 2017 -0400

Fix confusion about number of subplans in partitioned INSERT setup.

The above issue was about incorrectly using 'i' in
mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
ExecInitModifyTable(), where 'i' was actually meant to refer to the
positions in mtstate->mt_num_partitions. Actually for INSERT, there is
only a single plan element in mtstate->mt_plans[] array.

Similarly, for update-tuple routing, we cannot use
mtstate->mt_plans[i], because 'i' refers to position in
mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
order of mtstate->mt_partitions; in fact mt_plans has only the plans
that are to be scanned on pruned partitions; so it can well be a small
subset of total partitions.

I am working on an updated patch to fix the above.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#95)

1 attachment(s)

Re: UPDATE of partition key

On 13 June 2017 at 15:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

While rebasing my patch for the below recent commit, I realized that a
similar issue exists for the uptate-tuple-routing patch as well :

commit 78a030a441966d91bc7e932ef84da39c3ea7d970
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Jun 12 23:29:44 2017 -0400

Fix confusion about number of subplans in partitioned INSERT setup.

The above issue was about incorrectly using 'i' in
mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
ExecInitModifyTable(), where 'i' was actually meant to refer to the
positions in mtstate->mt_num_partitions. Actually for INSERT, there is
only a single plan element in mtstate->mt_plans[] array.

Similarly, for update-tuple routing, we cannot use
mtstate->mt_plans[i], because 'i' refers to position in
mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
order of mtstate->mt_partitions; in fact mt_plans has only the plans
that are to be scanned on pruned partitions; so it can well be a small
subset of total partitions.

I am working on an updated patch to fix the above.

Attached patch v10 fixes the above. In the existing code, where it
builds WCO constraints for each leaf partition; with the patch, that
code now is applicable to row-movement-updates as well. So the
assertions in the code are now updated to allow the same. Secondly,
the mapping for each of the leaf partitions was constructed using the
root partition attributes. Now in the patch, the
mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
reference. So effectively, map_partition_varattnos() now represents
not just parent-to-partition mapping, but rather, mapping between any
two partitions/partitioned_tables. It's done this way, so that we can
have a common WCO building code for inserts as well as updates. For
e.g. for inserts, the first (and only) WCO belongs to
node->nominalRelation so nominalRelation is used for
map_partition_varattnos(), whereas for updates, first WCO belongs to
the first resultRelInfo which is not same as nominalRelation. So in
the patch, in both cases, we use the first resultRelInfo and the WCO
of the first resultRelInfo for map_partition_varattnos().

Similar thing is done for Returning expressions.

---------

Another change in the patch is : for ExecInitQual() for WCO quals,
mtstate->ps is used as parent, rather than first plan. For updates,
first plan does not belong to the parent partition. In fact, I think
in all cases, we should use mtstate->ps as the parent.
mtstate->mt_plans[0] don't look like they should be considered parent
of these expressions. May be it does not matter to which parent we
link these quals, because there is no ReScan for ExecModifyTable().

Note that for RETURNING projection expressions, we do use mtstate->ps.

--------

There is another issue I discovered. The row-movement works fine if
the destination leaf partition has different attribute ordering than
the root : the existing insert-tuple-routing mapping handles that. But
if the source partition has different ordering w.r.t. the root, it has
a problem : there is no mapping in the opposite direction, i.e. from
the leaf to root. And we require that because the tuple of source leaf
partition needs to be converted to root partition tuple descriptor,
since ExecFindPartition() starts with root.

To fix this, I have introduced another mapping array
mtstate->mt_resultrel_maps[]. This corresponds to the
mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
because the update result relations are pruned subset of the total
leaf partitions.

So in ExecInsert, before calling ExecFindPartition(), we need to
convert the leaf partition tuple to root using this reverse mapping.
Since we need to convert the tuple here, and again after
ExecFindPartition() for the found leaf partition, I have replaced the
common code by new function ConvertPartitionTupleSlot().

-------

Used a new flag is_partitionkey_update in ExecInitModifyTable(), which
can be re-used in subsequent sections , rather than again calling
IsPartitionKeyUpdate() function again.

-------

Some more test scenarios added that cover above changes. Basically
partitions that have different tuple descriptors than parents.

Attachments:

update-partition-key_v10.patchapplication/octet-stream; name=update-partition-key_v10.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index ec015e9..9a46c1e 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2992,6 +2992,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3284,9 +3289,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index a7c9b9a..cacf8fb 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -921,7 +921,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -931,8 +932,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent)
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel)
 {
 	AttrNumber *part_attnos;
 	bool		found_whole_row;
@@ -940,13 +941,13 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 								 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
+										RelationGetDescr(from_rel)->natts,
 										&found_whole_row);
 	/* There can never be a whole-row reference here */
 	if (found_whole_row)
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ae79a2f..d9818b7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2658,7 +2658,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7f460bd..b29b12f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -103,8 +103,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
 
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
@@ -1823,15 +1821,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1859,51 +1852,65 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
-		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
 								 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+		if (map != NULL)
+		{
+			tuple = do_convert_tuple(tuple, map);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1911,7 +1918,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2024,8 +2032,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3312,8 +3321,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple it if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index c6a66b6..7e82482 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index ff5ad98..b0c13eb 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,6 +54,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
@@ -239,6 +242,34 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_slot. If no mapping present, keeps
+ * p_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple, TupleTableSlot **p_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_slot = mtstate->mt_partition_tuple_slot;
+	Assert(*p_slot != NULL);
+	ExecSetSlotDescriptor(*p_slot, map->outdesc);
+	ExecStoreTuple(tuple, *p_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -280,7 +311,38 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs to
+		 * be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_resultrel_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans-1)
+		{
+			int		map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+									  mtstate->mt_resultrel_maps[map_index],
+									  tuple, &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -290,7 +352,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 										 mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -317,23 +379,9 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+						mtstate->mt_partition_tupconv_maps[leaf_part_index],
+					    tuple, &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -451,7 +499,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -641,6 +689,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -649,6 +699,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -792,6 +845,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -815,8 +870,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -894,7 +949,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -908,6 +964,8 @@ ExecUpdate(ItemPointer tupleid,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	bool		partition_check_passed = true;
+	bool		has_br_trigger;
 
 	/*
 	 * abort the operation if not running transactions
@@ -928,16 +986,56 @@ ExecUpdate(ItemPointer tupleid,
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
 
 	/* BEFORE ROW UPDATE Triggers */
-	if (resultRelInfo->ri_TrigDesc &&
-		resultRelInfo->ri_TrigDesc->trig_update_before_row)
+	has_br_trigger = (resultRelInfo->ri_TrigDesc &&
+					  resultRelInfo->ri_TrigDesc->trig_update_before_row);
+
+	if (has_br_trigger)
 	{
-		slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-									tupleid, oldtuple, slot);
+		TupleTableSlot *trig_slot;
 
-		if (slot == NULL)		/* "do nothing" */
+		trig_slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
+										 tupleid, oldtuple, slot);
+
+		if (trig_slot == NULL)		/* "do nothing" */
 			return NULL;
 
+		if (resultRelInfo->ri_PartitionCheck)
+		{
+			bool		partition_check_passed_with_trig_tuple;
+
+			partition_check_passed =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+			partition_check_passed_with_trig_tuple =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+
+			if (partition_check_passed)
+			{
+				/*
+				 * If it's the trigger that is causing partition constraint
+				 * violation, abort. We don't want a trigger to cause tuple
+				 * routing.
+				 */
+				if (!partition_check_passed_with_trig_tuple)
+					ExecPartitionCheckEmitError(resultRelInfo,
+												trig_slot, estate);
+			}
+			else
+			{
+				/*
+				 * Partition constraint failed with original NEW tuple. But the
+				 * trigger might even have modifed the tuple such that it fits
+				 * back into the partition. So partition constraint check
+				 * should be based on *final* NEW tuple.
+				 */
+				partition_check_passed = partition_check_passed_with_trig_tuple;
+			}
+		}
+
 		/* trigger might have changed tuple */
+		slot = trig_slot;
 		tuple = ExecMaterializeSlot(slot);
 	}
 
@@ -1004,12 +1102,60 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition. With a BR trigger, the tuple has already gone through EPQ
+		 * and has been locked; so it won't change again. So, avoid an extra
+		 * partition check if we already did it above in the presence of BR
+		 * triggers.
+		 */
+		if (!has_br_trigger)
+		{
+			partition_check_passed =
+				(!resultRelInfo->ri_PartitionCheck ||
+				ExecPartitionCheck(resultRelInfo, slot, estate));
+		}
+
+		if (!partition_check_passed)
+		{
+			bool	concurrently_deleted;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with partition
+			 * constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+					   &concurrently_deleted, false, false);
+
+			/*
+			 * The row was already deleted by a concurrent DELETE. So we don't
+			 * have anything to update.
+			 */
+			if (concurrently_deleted)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1329,7 +1475,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1411,6 +1557,35 @@ fireASTriggers(ModifyTableState *node)
 	}
 }
 
+/*
+ * Check whether partition key is modified for any of the relations.
+ */
+static bool
+IsPartitionKeyUpdate(EState *estate, ResultRelInfo *result_rels, int num_rels)
+{
+	int		i;
+
+	/*
+	 * Each of the result relations has the updated columns set stored
+	 * according to its own column ordering. So we need to pull the attno of
+	 * the partition quals of each of the relations, and check if the updated
+	 * column attributes are present in the vars in the partition quals.
+	 */
+	for (i = 0; i < num_rels; i++)
+	{
+		ResultRelInfo *resultRelInfo = &result_rels[i];
+		Relation		rel = resultRelInfo->ri_RelationDesc;
+		Bitmapset	  *expr_attrs = NULL;
+
+		pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+		/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+		if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+			return true;
+	}
+
+	return false;
+}
 
 /* ----------------------------------------------------------------
  *	   ExecModifyTable
@@ -1619,12 +1794,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1664,11 +1840,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 {
 	ModifyTableState *mtstate;
 	CmdType		operation = node->operation;
+	bool		is_partitionkey_update = false;
 	int			nplans = list_length(node->plans);
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
@@ -1780,9 +1959,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Remember whether it is going to be an update of partition key. */
+	is_partitionkey_update =
+				(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+				operation == CMD_UPDATE &&
+		  		IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || is_partitionkey_update))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo *partitions;
@@ -1803,6 +1991,44 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_partitions = num_partitions;
 		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
+	}
+
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root partition
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.
+	 */
+	if (is_partitionkey_update)
+	{
+		TupleConversionMap **tup_conv_maps;
+		TupleDesc		outdesc;
+
+		Assert(mtstate->mt_num_partitions > 0);
+
+		mtstate->mt_resultrel_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap*) * nplans);
+
+		/* Get tuple descriptor of the root partition. */
+		outdesc = RelationGetDescr(mtstate->mt_partition_dispatch_info[0]->reldesc);
+
+		resultRelInfo = mtstate->resultRelInfo;
+		tup_conv_maps = mtstate->mt_resultrel_maps;
+		for (i = 0; i < nplans; i++)
+		{
+			TupleDesc indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+			tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+								 gettext_noop("could not convert row type"));
+		}
 	}
 
 	/*
@@ -1835,48 +2061,49 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE
+	 * row movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO qual
+		 * for each partition. Note that, if there are SubPlans in there, they
+		 * all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
+		Assert(is_partitionkey_update ||
+			   (operation == CMD_INSERT &&
 			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+			   mtstate->mt_nplans == 1));
+
 		resultRelInfo = mtstate->mt_partitions;
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
+			mappedWco = map_partition_varattnos(firstWco, firstVarno,
+												partrel, firstResultRel);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
 			resultRelInfo++;
 		}
@@ -1889,7 +2116,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1926,20 +2153,23 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
 		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList, firstVarno,
+											partrel, firstResultRel);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 									 resultRelInfo->ri_RelationDesc->rd_att);
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 0a1e468..91db4df 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -79,8 +79,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent);
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 8cc5f3a..7fe471f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -216,6 +219,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d33392f..3c96bf0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -943,8 +943,12 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;		/* Number of members in the following
 										 * arrays */
 	ResultRelInfo *mt_partitions;		/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
+
 	/* Per partition tuple conversion map */
+	TupleConversionMap **mt_partition_tupconv_maps;
+	/* Per resultRelInfo conversion map to convert tuples to root partition */
+	TupleConversionMap **mt_resultrel_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
 } ModifyTableState;
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..4073f6f 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,187 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
 ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (120, b, 15).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_parted" violates partition constraint
+DETAIL:  Failing row contains (2, 2, 10).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+ERROR:  new row for relation "sub_part1" violates partition constraint
+DETAIL:  Failing row contains (2, 70, 1).
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..34da9c8 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,123 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
-
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#97

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#96)

1 attachment(s)

Re: UPDATE of partition key

When I tested partition-key-update on a partitioned table having no
child partitions, it crashed. This is because there is an
Assert(mtstate->mt_num_partitions > 0) for creating the
partition-to-root map, which fails if there are no partitions under
the partitioned table. Actually we should skp creating this map if
there are no partitions under the partitioned table on which UPDATE is
run. So the attached patch has this new change to fix it (and
appropriate additional test case added) :

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2006,15 +2006,14 @@ ExecInitModifyTable(ModifyTable *node, EState
*estate, int eflags)
         * descriptor of a source partition does not match the root partition
         * descriptor. In such case we need to convert tuples to the
root partition
         * tuple descriptor, because the search for destination partition starts
-        * from the root.
+        * from the root. Skip this setup if it's not a partition key
update or if
+        * there are no partitions below this partitioned table.
         */
-       if (is_partitionkey_update)
+       if (is_partitionkey_update && mtstate->mt_num_partitions > 0)
        {
                TupleConversionMap **tup_conv_maps;
                TupleDesc               outdesc;

- Assert(mtstate->mt_num_partitions > 0);
-
mtstate->mt_resultrel_maps =
(TupleConversionMap **)
palloc0(sizeof(TupleConversionMap*) * nplans);

On 15 June 2017 at 23:06, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 13 June 2017 at 15:40, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

While rebasing my patch for the below recent commit, I realized that a
similar issue exists for the uptate-tuple-routing patch as well :

commit 78a030a441966d91bc7e932ef84da39c3ea7d970
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Jun 12 23:29:44 2017 -0400

Fix confusion about number of subplans in partitioned INSERT setup.

The above issue was about incorrectly using 'i' in
mtstate->mt_plans[i] during handling WITH CHECK OPTIONS in
ExecInitModifyTable(), where 'i' was actually meant to refer to the
positions in mtstate->mt_num_partitions. Actually for INSERT, there is
only a single plan element in mtstate->mt_plans[] array.

Similarly, for update-tuple routing, we cannot use
mtstate->mt_plans[i], because 'i' refers to position in
mtstate->mt_partitions[] , whereas mtstate->mt_plans is not at all in
order of mtstate->mt_partitions; in fact mt_plans has only the plans
that are to be scanned on pruned partitions; so it can well be a small
subset of total partitions.

I am working on an updated patch to fix the above.

Attached patch v10 fixes the above. In the existing code, where it
builds WCO constraints for each leaf partition; with the patch, that
code now is applicable to row-movement-updates as well. So the
assertions in the code are now updated to allow the same. Secondly,
the mapping for each of the leaf partitions was constructed using the
root partition attributes. Now in the patch, the
mtstate->resultRelInfo[0] (i.e. the first resultRelInfo) is used as
reference. So effectively, map_partition_varattnos() now represents
not just parent-to-partition mapping, but rather, mapping between any
two partitions/partitioned_tables. It's done this way, so that we can
have a common WCO building code for inserts as well as updates. For
e.g. for inserts, the first (and only) WCO belongs to
node->nominalRelation so nominalRelation is used for
map_partition_varattnos(), whereas for updates, first WCO belongs to
the first resultRelInfo which is not same as nominalRelation. So in
the patch, in both cases, we use the first resultRelInfo and the WCO
of the first resultRelInfo for map_partition_varattnos().

Similar thing is done for Returning expressions.

---------

Another change in the patch is : for ExecInitQual() for WCO quals,
mtstate->ps is used as parent, rather than first plan. For updates,
first plan does not belong to the parent partition. In fact, I think
in all cases, we should use mtstate->ps as the parent.
mtstate->mt_plans[0] don't look like they should be considered parent
of these expressions. May be it does not matter to which parent we
link these quals, because there is no ReScan for ExecModifyTable().

Note that for RETURNING projection expressions, we do use mtstate->ps.

--------

There is another issue I discovered. The row-movement works fine if
the destination leaf partition has different attribute ordering than
the root : the existing insert-tuple-routing mapping handles that. But
if the source partition has different ordering w.r.t. the root, it has
a problem : there is no mapping in the opposite direction, i.e. from
the leaf to root. And we require that because the tuple of source leaf
partition needs to be converted to root partition tuple descriptor,
since ExecFindPartition() starts with root.

To fix this, I have introduced another mapping array
mtstate->mt_resultrel_maps[]. This corresponds to the
mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
because the update result relations are pruned subset of the total
leaf partitions.

So in ExecInsert, before calling ExecFindPartition(), we need to
convert the leaf partition tuple to root using this reverse mapping.
Since we need to convert the tuple here, and again after
ExecFindPartition() for the found leaf partition, I have replaced the
common code by new function ConvertPartitionTupleSlot().

-------

Used a new flag is_partitionkey_update in ExecInitModifyTable(), which
can be re-used in subsequent sections , rather than again calling
IsPartitionKeyUpdate() function again.

-------

Some more test scenarios added that cover above changes. Basically
partitions that have different tuple descriptors than parents.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v11.patchapplication/octet-stream; name=update-partition-key_v11.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index a7c9b9a..cacf8fb 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -921,7 +921,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -931,8 +932,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent)
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel)
 {
 	AttrNumber *part_attnos;
 	bool		found_whole_row;
@@ -940,13 +941,13 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 								 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
+										RelationGetDescr(from_rel)->natts,
 										&found_whole_row);
 	/* There can never be a whole-row reference here */
 	if (found_whole_row)
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ae79a2f..d9818b7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2658,7 +2658,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7f460bd..b29b12f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -103,8 +103,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
 
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
@@ -1823,15 +1821,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1859,51 +1852,65 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
-		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
 								 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+		if (map != NULL)
+		{
+			tuple = do_convert_tuple(tuple, map);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-		  errmsg("new row for relation \"%s\" violates partition constraint",
-				 RelationGetRelationName(orig_rel)),
-			val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+	  errmsg("new row for relation \"%s\" violates partition constraint",
+			 RelationGetRelationName(orig_rel)),
+		val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1911,7 +1918,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2024,8 +2032,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3312,8 +3321,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple it if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index c6a66b6..7e82482 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -389,7 +389,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -448,7 +448,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index ff5ad98..a658ee7 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,6 +54,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
@@ -239,6 +242,34 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_slot. If no mapping present, keeps
+ * p_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple, TupleTableSlot **p_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_slot = mtstate->mt_partition_tuple_slot;
+	Assert(*p_slot != NULL);
+	ExecSetSlotDescriptor(*p_slot, map->outdesc);
+	ExecStoreTuple(tuple, *p_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -280,7 +311,38 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs to
+		 * be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_resultrel_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans-1)
+		{
+			int		map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+									  mtstate->mt_resultrel_maps[map_index],
+									  tuple, &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -290,7 +352,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 										 mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -317,23 +379,9 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+						mtstate->mt_partition_tupconv_maps[leaf_part_index],
+					    tuple, &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -451,7 +499,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -641,6 +689,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -649,6 +699,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -792,6 +845,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -815,8 +870,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -894,7 +949,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -908,6 +964,8 @@ ExecUpdate(ItemPointer tupleid,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	bool		partition_check_passed = true;
+	bool		has_br_trigger;
 
 	/*
 	 * abort the operation if not running transactions
@@ -928,16 +986,56 @@ ExecUpdate(ItemPointer tupleid,
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
 
 	/* BEFORE ROW UPDATE Triggers */
-	if (resultRelInfo->ri_TrigDesc &&
-		resultRelInfo->ri_TrigDesc->trig_update_before_row)
+	has_br_trigger = (resultRelInfo->ri_TrigDesc &&
+					  resultRelInfo->ri_TrigDesc->trig_update_before_row);
+
+	if (has_br_trigger)
 	{
-		slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-									tupleid, oldtuple, slot);
+		TupleTableSlot *trig_slot;
 
-		if (slot == NULL)		/* "do nothing" */
+		trig_slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
+										 tupleid, oldtuple, slot);
+
+		if (trig_slot == NULL)		/* "do nothing" */
 			return NULL;
 
+		if (resultRelInfo->ri_PartitionCheck)
+		{
+			bool		partition_check_passed_with_trig_tuple;
+
+			partition_check_passed =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+			partition_check_passed_with_trig_tuple =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+
+			if (partition_check_passed)
+			{
+				/*
+				 * If it's the trigger that is causing partition constraint
+				 * violation, abort. We don't want a trigger to cause tuple
+				 * routing.
+				 */
+				if (!partition_check_passed_with_trig_tuple)
+					ExecPartitionCheckEmitError(resultRelInfo,
+												trig_slot, estate);
+			}
+			else
+			{
+				/*
+				 * Partition constraint failed with original NEW tuple. But the
+				 * trigger might even have modifed the tuple such that it fits
+				 * back into the partition. So partition constraint check
+				 * should be based on *final* NEW tuple.
+				 */
+				partition_check_passed = partition_check_passed_with_trig_tuple;
+			}
+		}
+
 		/* trigger might have changed tuple */
+		slot = trig_slot;
 		tuple = ExecMaterializeSlot(slot);
 	}
 
@@ -1004,12 +1102,60 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition. With a BR trigger, the tuple has already gone through EPQ
+		 * and has been locked; so it won't change again. So, avoid an extra
+		 * partition check if we already did it above in the presence of BR
+		 * triggers.
+		 */
+		if (!has_br_trigger)
+		{
+			partition_check_passed =
+				(!resultRelInfo->ri_PartitionCheck ||
+				ExecPartitionCheck(resultRelInfo, slot, estate));
+		}
+
+		if (!partition_check_passed)
+		{
+			bool	concurrently_deleted;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with partition
+			 * constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+					   &concurrently_deleted, false, false);
+
+			/*
+			 * The row was already deleted by a concurrent DELETE. So we don't
+			 * have anything to update.
+			 */
+			if (concurrently_deleted)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1329,7 +1475,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1411,6 +1557,35 @@ fireASTriggers(ModifyTableState *node)
 	}
 }
 
+/*
+ * Check whether partition key is modified for any of the relations.
+ */
+static bool
+IsPartitionKeyUpdate(EState *estate, ResultRelInfo *result_rels, int num_rels)
+{
+	int		i;
+
+	/*
+	 * Each of the result relations has the updated columns set stored
+	 * according to its own column ordering. So we need to pull the attno of
+	 * the partition quals of each of the relations, and check if the updated
+	 * column attributes are present in the vars in the partition quals.
+	 */
+	for (i = 0; i < num_rels; i++)
+	{
+		ResultRelInfo *resultRelInfo = &result_rels[i];
+		Relation		rel = resultRelInfo->ri_RelationDesc;
+		Bitmapset	  *expr_attrs = NULL;
+
+		pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+		/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+		if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+			return true;
+	}
+
+	return false;
+}
 
 /* ----------------------------------------------------------------
  *	   ExecModifyTable
@@ -1619,12 +1794,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								&node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								&node->mt_epqstate, estate, node->canSetTag);
+								&node->mt_epqstate, estate,
+								NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1664,11 +1840,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 {
 	ModifyTableState *mtstate;
 	CmdType		operation = node->operation;
+	bool		is_partitionkey_update = false;
 	int			nplans = list_length(node->plans);
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
@@ -1780,9 +1959,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Remember whether it is going to be an update of partition key. */
+	is_partitionkey_update =
+				(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+				operation == CMD_UPDATE &&
+		  		IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || is_partitionkey_update))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo *partitions;
@@ -1803,6 +1991,43 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_partitions = num_partitions;
 		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
+	}
+
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root partition
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root. Skip this setup if it's not a partition key update or if
+	 * there are no partitions below this partitioned table.
+	 */
+	if (is_partitionkey_update && mtstate->mt_num_partitions > 0)
+	{
+		TupleConversionMap **tup_conv_maps;
+		TupleDesc		outdesc;
+
+		mtstate->mt_resultrel_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap*) * nplans);
+
+		/* Get tuple descriptor of the root partition. */
+		outdesc = RelationGetDescr(mtstate->mt_partition_dispatch_info[0]->reldesc);
+
+		resultRelInfo = mtstate->resultRelInfo;
+		tup_conv_maps = mtstate->mt_resultrel_maps;
+		for (i = 0; i < nplans; i++)
+		{
+			TupleDesc indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+			tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+								 gettext_noop("could not convert row type"));
+		}
 	}
 
 	/*
@@ -1835,48 +2060,49 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE
+	 * row movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO qual
+		 * for each partition. Note that, if there are SubPlans in there, they
+		 * all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
+		Assert(is_partitionkey_update ||
+			   (operation == CMD_INSERT &&
 			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+			   mtstate->mt_nplans == 1));
+
 		resultRelInfo = mtstate->mt_partitions;
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
+			mappedWco = map_partition_varattnos(firstWco, firstVarno,
+												partrel, firstResultRel);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
 			resultRelInfo++;
 		}
@@ -1889,7 +2115,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1926,20 +2152,23 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
 		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList, firstVarno,
+											partrel, firstResultRel);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 									 resultRelInfo->ri_RelationDesc->rd_att);
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 0a1e468..91db4df 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -79,8 +79,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent);
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 8cc5f3a..7fe471f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -216,6 +219,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9c08528..0789587 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -943,8 +943,12 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;		/* Number of members in the following
 										 * arrays */
 	ResultRelInfo *mt_partitions;		/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
+
 	/* Per partition tuple conversion map */
+	TupleConversionMap **mt_partition_tupconv_maps;
+	/* Per resultRelInfo conversion map to convert tuples to root partition */
+	TupleConversionMap **mt_resultrel_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
 } ModifyTableState;
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..f3c03a7 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,189 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
 ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (120, b, 15).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_parted" violates partition constraint
+DETAIL:  Failing row contains (2, 2, 10).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+ERROR:  new row for relation "sub_part1" violates partition constraint
+DETAIL:  Failing row contains (2, 70, 1).
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..0113c7d 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,128 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+insert into part_a_1_a_10 values ('a', 1);
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#98

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Amit Khandekar (#96)

Re: UPDATE of partition key

On Fri, Jun 16, 2017 at 5:36 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

There is another issue I discovered. The row-movement works fine if
the destination leaf partition has different attribute ordering than
the root : the existing insert-tuple-routing mapping handles that. But
if the source partition has different ordering w.r.t. the root, it has
a problem : there is no mapping in the opposite direction, i.e. from
the leaf to root. And we require that because the tuple of source leaf
partition needs to be converted to root partition tuple descriptor,
since ExecFindPartition() starts with root.

To fix this, I have introduced another mapping array
mtstate->mt_resultrel_maps[]. This corresponds to the
mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
because the update result relations are pruned subset of the total
leaf partitions.

Hi Amit & Amit,

Just a thought: If I understand correctly this new array of tuple
conversion maps is the same as mtstate->mt_transition_tupconv_maps in
my patch transition-tuples-from-child-tables-v11.patch (hopefully soon
to be committed to close a PG10 open item). In my patch I bounce
transition tuples from child relations up to the named relation's
triggers, and in this patch you bounce child tuples up to the named
relation for rerouting, so the conversion requirement is the same.
Perhaps we could consider refactoring to build a common struct member
on demand for the row movement patch at some point in the future if it
makes the code cleaner.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#96)

Re: UPDATE of partition key

On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached patch v10 fixes the above. In the existing code, where it
builds WCO constraints for each leaf partition; with the patch, that
code now is applicable to row-movement-updates as well.

I guess I don't see why it should work like this. In the INSERT case,
we must build withCheckOption objects for each partition because those
partitions don't appear in the plan otherwise -- but in the UPDATE
case, they're already there, so why do we need to build anything at
all? Similarly for RETURNING projections. How are the things we need
for those cases not already getting built, associated with the
relevant resultRelInfos? Maybe there's a concern if some children got
pruned - they could turn out later to be the children into which
tuples need to be routed. But the patch makes no distinction between
possibly-pruned children and any others.

There is another issue I discovered. The row-movement works fine if
the destination leaf partition has different attribute ordering than
the root : the existing insert-tuple-routing mapping handles that. But
if the source partition has different ordering w.r.t. the root, it has
a problem : there is no mapping in the opposite direction, i.e. from
the leaf to root. And we require that because the tuple of source leaf
partition needs to be converted to root partition tuple descriptor,
since ExecFindPartition() starts with root.

Seems reasonable, but...

To fix this, I have introduced another mapping array
mtstate->mt_resultrel_maps[]. This corresponds to the
mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
because the update result relations are pruned subset of the total
leaf partitions.

... I don't understand how you can *not* need a per-leaf-partition
mapping. I mean, maybe you only need the mapping for the *unpruned*
leaf partitions but you certainly need a separate mapping for each one
of those.

It's possible to imagine driving the tuple routing off of just the
partition key attributes, extracted from wherever they are inside the
tuple at the current level, rather than converting to the root's tuple
format. However, that's not totally straightforward because there
could be multiple levels of partitioning throughout the tree and
different attributes might be needed at different levels. Moreover,
in most cases, the mappings are going to end up being no-ops because
the column order will be the same, so it's probably not worth
complicating the code to try to avoid a double conversion that usually
won't happen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#99)

Re: UPDATE of partition key

On 20 June 2017 at 03:42, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Just a thought: If I understand correctly this new array of tuple
conversion maps is the same as mtstate->mt_transition_tupconv_maps in
my patch transition-tuples-from-child-tables-v11.patch (hopefully soon
to be committed to close a PG10 open item). In my patch I bounce
transition tuples from child relations up to the named relation's
triggers, and in this patch you bounce child tuples up to the named
relation for rerouting, so the conversion requirement is the same.
Perhaps we could consider refactoring to build a common struct member
on demand for the row movement patch at some point in the future if it
makes the code cleaner.

I agree; thanks for bringing this to my attention. The conversion maps
in my patch and yours do sound like they are exactly same. And even in
case where both update-row-movement and transition tables are playing
together, the same map should serve the purpose of both. I will keep a
watch on your patch, and check how I can adjust my patch so that I
don't have to refactor the mapping.

One difference I see is : in your patch, in ExecModifyTable() we jump
the current map position for each successive subplan, whereas in my
patch, in ExecInsert() we deduce the position of the right map to be
fetched using the position of the current resultRelInfo in the
mtstate->resultRelInfo[] array. I think your way is more consistent
with the existing code.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#99)

Re: UPDATE of partition key

On 20 June 2017 at 03:46, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 15, 2017 at 1:36 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached patch v10 fixes the above. In the existing code, where it
builds WCO constraints for each leaf partition; with the patch, that
code now is applicable to row-movement-updates as well.

I guess I don't see why it should work like this. In the INSERT case,
we must build withCheckOption objects for each partition because those
partitions don't appear in the plan otherwise -- but in the UPDATE
case, they're already there, so why do we need to build anything at
all? Similarly for RETURNING projections. How are the things we need
for those cases not already getting built, associated with the
relevant resultRelInfos? Maybe there's a concern if some children got
pruned - they could turn out later to be the children into which
tuples need to be routed. But the patch makes no distinction
between possibly-pruned children and any others.

Yes, only a subset of the partitions appear in the UPDATE subplans. I
think typically for updates, a very small subset of the total leaf
partitions will be there in the plans, others would get pruned. IMHO,
it would not be worth having an optimization where it opens only those
leaf partitions which are not already there in the subplans. Without
the optimization, we are able to re-use the INSERT infrastructure
without additional changes.

There is another issue I discovered. The row-movement works fine if
the destination leaf partition has different attribute ordering than
the root : the existing insert-tuple-routing mapping handles that. But
if the source partition has different ordering w.r.t. the root, it has
a problem : there is no mapping in the opposite direction, i.e. from
the leaf to root. And we require that because the tuple of source leaf
partition needs to be converted to root partition tuple descriptor,
since ExecFindPartition() starts with root.

Seems reasonable, but...

To fix this, I have introduced another mapping array
mtstate->mt_resultrel_maps[]. This corresponds to the
mtstate->resultRelInfo[]. We don't require per-leaf-partition mapping,
because the update result relations are pruned subset of the total
leaf partitions.

... I don't understand how you can *not* need a per-leaf-partition
mapping. I mean, maybe you only need the mapping for the *unpruned*
leaf partitions

Yes, we need the mapping only for the unpruned leaf partitions, and
those partitions are available in the per-subplan resultRelInfo's.

but you certainly need a separate mapping for each one of those.

You mean *each* of the leaf partitions ? I didn't get why we would
need it for each one. The tuple targeted for update belongs to one of
the per-subplan resultInfos. And this tuple is to be routed to another
leaf partition. So the reverse mapping is for conversion from the
source resultRelinfo to the root partition. I am unable to figure out
a scenario where we would require this reverse mapping for partitions
on which UPDATE is *not* going to be executed.

It's possible to imagine driving the tuple routing off of just the
partition key attributes, extracted from wherever they are inside the
tuple at the current level, rather than converting to the root's tuple
format. However, that's not totally straightforward because there
could be multiple levels of partitioning throughout the tree and
different attributes might be needed at different levels.

Yes, the conversion anyway occurs at each of these levels even for
insert, specifically because there can be different partition
attributes each time. For update, its only one additional conversion.
But yes, this new mapping would be required for this one single
conversion.

Moreover,
in most cases, the mappings are going to end up being no-ops because
the column order will be the same, so it's probably not worth
complicating the code to try to avoid a double conversion that usually
won't happen.

I agree.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#101)

Re: UPDATE of partition key

On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I guess I don't see why it should work like this. In the INSERT case,
we must build withCheckOption objects for each partition because those
partitions don't appear in the plan otherwise -- but in the UPDATE
case, they're already there, so why do we need to build anything at
all? Similarly for RETURNING projections. How are the things we need
for those cases not already getting built, associated with the
relevant resultRelInfos? Maybe there's a concern if some children got
pruned - they could turn out later to be the children into which
tuples need to be routed. But the patch makes no distinction
between possibly-pruned children and any others.

Yes, only a subset of the partitions appear in the UPDATE subplans. I
think typically for updates, a very small subset of the total leaf
partitions will be there in the plans, others would get pruned. IMHO,
it would not be worth having an optimization where it opens only those
leaf partitions which are not already there in the subplans. Without
the optimization, we are able to re-use the INSERT infrastructure
without additional changes.

Well, that is possible, but certainly not guaranteed. I mean,
somebody could do a whole-table UPDATE, or an UPDATE that hits a
smattering of rows in every partition; e.g. the table is partitioned
on order number, and you do UPDATE lineitem SET product_code = 'K372B'
WHERE product_code = 'K372'.

Leaving that aside, the point here is that you're rebuilding
withCheckOptions and returningLists that have already been built in
the planner. That's bad for two reasons. First, it's inefficient,
especially if there are many partitions. Second, it will amount to a
functional bug if you get a different answer than the planner did.
Note this comment in the existing code:

/*
* Build WITH CHECK OPTION constraints for each leaf partition rel. Note
* that we didn't build the withCheckOptionList for each partition within
* the planner, but simple translation of the varattnos for each partition
* will suffice. This only occurs for the INSERT case; UPDATE/DELETE
* cases are handled above.
*/

The comment "UPDATE/DELETE cases are handled above" is referring to
the code that initializes the WCOs generated by the planner. You've
modified the comment in your patch, but the associated code: your
updated comment says that only "DELETEs and local UPDATES are handled
above", but in reality, *all* updates are still handled above. And
then they are handled again here. Similarly for returning lists.
It's certainly not OK for the comment to be inaccurate, but I think
it's also bad to redo the work which the planner has already done,
even if it makes the patch smaller.

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway. I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken? If it works
for some reason, the comments don't explain what that reason is.

... I don't understand how you can *not* need a per-leaf-partition
mapping. I mean, maybe you only need the mapping for the *unpruned*
leaf partitions

Yes, we need the mapping only for the unpruned leaf partitions, and
those partitions are available in the per-subplan resultRelInfo's.

OK.

but you certainly need a separate mapping for each one of those.

You mean *each* of the leaf partitions ? I didn't get why we would
need it for each one. The tuple targeted for update belongs to one of
the per-subplan resultInfos. And this tuple is to be routed to another
leaf partition. So the reverse mapping is for conversion from the
source resultRelinfo to the root partition. I am unable to figure out
a scenario where we would require this reverse mapping for partitions
on which UPDATE is *not* going to be executed.

I agree - the reverse mapping is only needed for the partitions in
which UPDATE will be executed.

Some other things:

+             * The row was already deleted by a concurrent DELETE. So we don't
+             * have anything to update.

I find this explanation, and the surrounding comments, inadequate. It
doesn't really explain why we're doing this. I think it should say
something like this: For a normal UPDATE, the case where the tuple has
been the subject of a concurrent UPDATE or DELETE would be handled by
the EvalPlanQual machinery, but for an UPDATE that we've translated
into a DELETE from this partition and an INSERT into some other
partition, that's not available, because CTID chains can't span
relation boundaries. We mimic the semantics to a limited extent by
skipping the INSERT if the DELETE fails to find a tuple. This ensures
that two concurrent attempts to UPDATE the same tuple at the same time
can't turn one tuple into two, and that an UPDATE of a just-deleted
tuple can't resurrect it.

+            bool        partition_check_passed_with_trig_tuple;
+
+            partition_check_passed =
+                (resultRelInfo->ri_PartitionCheck &&
+                 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+            partition_check_passed_with_trig_tuple =
+                (resultRelInfo->ri_PartitionCheck &&
+                 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+            if (partition_check_passed)
+            {
+                /*
+                 * If it's the trigger that is causing partition constraint
+                 * violation, abort. We don't want a trigger to cause tuple
+                 * routing.
+                 */
+                if (!partition_check_passed_with_trig_tuple)
+                    ExecPartitionCheckEmitError(resultRelInfo,
+                                                trig_slot, estate);
+            }
+            else
+            {
+                /*
+                 * Partition constraint failed with original NEW tuple. But the
+                 * trigger might even have modifed the tuple such that it fits
+                 * back into the partition. So partition constraint check
+                 * should be based on *final* NEW tuple.
+                 */
+                partition_check_passed =
partition_check_passed_with_trig_tuple;
+            }

Maybe I inadvertently gave the contrary impression in some prior
review, but this logic doesn't seem right to me. I don't think
there's any problem with a BR UPDATE trigger causing tuple routing.
What I want to avoid is repeatedly rerouting the same tuple, but I
don't think that could happen even without this guard. We've now fixed
insert tuple routing so that a BR INSERT trigger can't cause the
partition constraint to be violated (cf. commit
15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for
update tuple routing to trigger additional BR UPDATE triggers. So I
don't see the point of checking the constraints twice here. I think
what you want to do is get rid of all the changes here and instead
adjust the logic just before ExecConstraints() to invoke
ExecPartitionCheck() on the post-trigger version of the tuple.

Parenthetically, if we decided to keep this logic as you have it, the
code that sets partition_check_passed and
partition_check_passed_with_trig_tuple doesn't need to check
resultRelInfo->ri_PartitionCheck because the surrounding "if" block
already did.

+    for (i = 0; i < num_rels; i++)
+    {
+        ResultRelInfo *resultRelInfo = &result_rels[i];
+        Relation        rel = resultRelInfo->ri_RelationDesc;
+        Bitmapset     *expr_attrs = NULL;
+
+        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+            return true;
+    }

This seems like an awfully expensive way of performing this test.
Under what circumstances could this be true for some result relations
and false for others; or in other words, why do we have to loop over
all of the result relations? It seems to me that the user has typed
something like:

UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ...
AND thunk = ...

If either thingy or whatsit is a partitioning column, UPDATE tuple
routing might be needed - and it should be able to test that by a
*single* comparison between the set of columns being updated and the
partitioning columns, without needing to repeat for every partitions.
Perhaps that test needs to be done at plan time and saved in the plan,
rather than performed here -- or maybe it's easy enough to do it here.

One problem is that, if BR UPDATE triggers are in fact allowed to
cause tuple routing as I proposed above, the presence of a BR UPDATE
trigger for any partition could necessitate UPDATE tuple routing for
queries that wouldn't otherwise need it. But even if you end up
inserting a test for that case, it can surely be a lot cheaper than
this, since it only involves checking a boolean flag, not a bitmapset.
It could be argue that we ought to prohibit BR UPDATE triggers from
causing tuple routing so that we don't have to do this test at all,
but I'm not sure that's a good trade-off. It seems to necessitate
checking the partition constraint twice per tuple instead of once per
tuple, which like a very heavy price.

+#define GetUpdatedColumns(relinfo, estate) \
+    (rt_fetch((relinfo)->ri_RangeTableIndex,
(estate)->es_range_table)->updatedCols)

I think this should be moved to a header file (and maybe turned into a
static inline function) rather than copy-pasting the definition into a
new file.

-            List       *mapped_wcoList;
+            List       *mappedWco;
             List       *wcoExprs = NIL;
             ListCell   *ll;

-            /* varno = node->nominalRelation */
-            mapped_wcoList = map_partition_varattnos(wcoList,
-                                                     node->nominalRelation,
-                                                     partrel, rel);
-            foreach(ll, mapped_wcoList)
+            mappedWco = map_partition_varattnos(firstWco, firstVarno,
+                                                partrel, firstResultRel);
+            foreach(ll, mappedWco)
             {
                 WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
                 ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-                                                   plan);
+                                                   &mtstate->ps);

wcoExprs = lappend(wcoExprs, wcoExpr);
}

-            resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+            resultRelInfo->ri_WithCheckOptions = mappedWco;

Renaming the variable looks fairly pointless, unless I'm missing something?

Regarding the tests, it seems like you've got a test case where you
update a sub-partition and it fails because the tuple would need to be
moved out of a sub-tree, which is good. But I think it would also be
good to have a case where you update a sub-partition and it succeeds
in moving the tuple within the subtree. I don't see one like that
presently; it seems all the others update the topmost root or the
leaf. I also think it would be a good idea to make sub_parted's
column order different from both list_parted and its own children, and
maybe use a diversity of data types (e.g. int4, int8, text instead of
making everything int).

+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;

The extra space before the comma looks strange.

Also, please make a habit of checking patches for whitespace errors
using git diff --check.

[rhaas pgsql]$ git diff --check
src/backend/executor/nodeModifyTable.c:384: indent with spaces.
+                        tuple, &slot);
src/backend/executor/nodeModifyTable.c:1966: space before tab in indent.
+                IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));

You will notice these kinds of things if you read the diff you are
submitting before you press send, because git highlights them in
bright red. That's a good practice for many other reasons, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Robert Haas (#102)

Re: UPDATE of partition key

On 2017/06/21 3:53, Robert Haas wrote:

On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I guess I don't see why it should work like this. In the INSERT case,
we must build withCheckOption objects for each partition because those
partitions don't appear in the plan otherwise -- but in the UPDATE
case, they're already there, so why do we need to build anything at
all? Similarly for RETURNING projections. How are the things we need
for those cases not already getting built, associated with the
relevant resultRelInfos? Maybe there's a concern if some children got
pruned - they could turn out later to be the children into which
tuples need to be routed. But the patch makes no distinction
between possibly-pruned children and any others.

Yes, only a subset of the partitions appear in the UPDATE subplans. I
think typically for updates, a very small subset of the total leaf
partitions will be there in the plans, others would get pruned. IMHO,
it would not be worth having an optimization where it opens only those
leaf partitions which are not already there in the subplans. Without
the optimization, we are able to re-use the INSERT infrastructure
without additional changes.

Well, that is possible, but certainly not guaranteed. I mean,
somebody could do a whole-table UPDATE, or an UPDATE that hits a
smattering of rows in every partition; e.g. the table is partitioned
on order number, and you do UPDATE lineitem SET product_code = 'K372B'
WHERE product_code = 'K372'.

Leaving that aside, the point here is that you're rebuilding
withCheckOptions and returningLists that have already been built in
the planner. That's bad for two reasons. First, it's inefficient,
especially if there are many partitions. Second, it will amount to a
functional bug if you get a different answer than the planner did.
Note this comment in the existing code:

/*
* Build WITH CHECK OPTION constraints for each leaf partition rel. Note
* that we didn't build the withCheckOptionList for each partition within
* the planner, but simple translation of the varattnos for each partition
* will suffice. This only occurs for the INSERT case; UPDATE/DELETE
* cases are handled above.
*/

The comment "UPDATE/DELETE cases are handled above" is referring to
the code that initializes the WCOs generated by the planner. You've
modified the comment in your patch, but the associated code: your
updated comment says that only "DELETEs and local UPDATES are handled
above", but in reality, *all* updates are still handled above. And
then they are handled again here. Similarly for returning lists.
It's certainly not OK for the comment to be inaccurate, but I think
it's also bad to redo the work which the planner has already done,
even if it makes the patch smaller.

I guess this has to do with the UPDATE turning into DELETE+INSERT. So, it
seems like WCOs are being initialized for the leaf partitions
(ResultRelInfos in the mt_partitions array) that are in turn are
initialized for the aforementioned INSERT. That's why the term "...local
UPDATEs" in the new comment text.

If that's true, I wonder if it makes sense to apply what would be
WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
by calling ExecInsert()?

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway. I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken? If it works
for some reason, the comments don't explain what that reason is.

Yep, it's more appropriate to use
ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That
is, if answer to the question I raised above is positive.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Langote (#103)

Re: UPDATE of partition key

On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:>> The comment "UPDATE/DELETE
cases are handled above" is referring to

the code that initializes the WCOs generated by the planner. You've
modified the comment in your patch, but the associated code: your
updated comment says that only "DELETEs and local UPDATES are handled
above", but in reality, *all* updates are still handled above. And
then they are handled again here. Similarly for returning lists.
It's certainly not OK for the comment to be inaccurate, but I think
it's also bad to redo the work which the planner has already done,
even if it makes the patch smaller.

I guess this has to do with the UPDATE turning into DELETE+INSERT. So, it
seems like WCOs are being initialized for the leaf partitions
(ResultRelInfos in the mt_partitions array) that are in turn are
initialized for the aforementioned INSERT. That's why the term "...local
UPDATEs" in the new comment text.

If that's true, I wonder if it makes sense to apply what would be
WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
by calling ExecInsert()?

I think we probably should apply the insert policy, just as we're
executing the insert trigger.

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway. I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken? If it works
for some reason, the comments don't explain what that reason is.

Yep, it's more appropriate to use
ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That
is, if answer to the question I raised above is positive.

The questions appear to me to be independent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#102)

Re: UPDATE of partition key

On 21 June 2017 at 00:23, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 20, 2017 at 2:54 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I guess I don't see why it should work like this. In the INSERT case,
we must build withCheckOption objects for each partition because those
partitions don't appear in the plan otherwise -- but in the UPDATE
case, they're already there, so why do we need to build anything at
all? Similarly for RETURNING projections. How are the things we need
for those cases not already getting built, associated with the
relevant resultRelInfos? Maybe there's a concern if some children got
pruned - they could turn out later to be the children into which
tuples need to be routed. But the patch makes no distinction
between possibly-pruned children and any others.

Yes, only a subset of the partitions appear in the UPDATE subplans. I
think typically for updates, a very small subset of the total leaf
partitions will be there in the plans, others would get pruned. IMHO,
it would not be worth having an optimization where it opens only those
leaf partitions which are not already there in the subplans. Without
the optimization, we are able to re-use the INSERT infrastructure
without additional changes.

Well, that is possible, but certainly not guaranteed. I mean,
somebody could do a whole-table UPDATE, or an UPDATE that hits a
smattering of rows in every partition;

I am not saying that it's guaranteed to be a small subset. I am saying
that it would be typically a small subset for
update-of-partitioned-key case. Seems weird if a user causes an
update-row-movement for multiple partitions at the same time.
Generally it would be an administrative task where some/all of the
rows of a partition need to have their partition key updated that
cause them to change their partition, and so there would be probably a
where clause that would narrow down the update to that particular
partition, because without the where clause the update is anyway
slower and it's redundant to scan all other partitions.

But, point taken, that there can always be certain cases involving
multiple table partition-key updates.

e.g. the table is partitioned on order number, and you do UPDATE
lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.

This query does not update order number, so here there is no
partition-key-update. Are you thinking that the patch is generating
the per-leaf-partition WCO expressions even for a update not involving
a partition key ?

Leaving that aside, the point here is that you're rebuilding
withCheckOptions and returningLists that have already been built in
the planner. That's bad for two reasons. First, it's inefficient,
especially if there are many partitions.

Yeah, I agree that this becomes more and more redundant if the update
involves more partitions.

Second, it will amount to a functional bug if you get a
different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

Note this comment in the existing code:

/*
* Build WITH CHECK OPTION constraints for each leaf partition rel. Note
* that we didn't build the withCheckOptionList for each partition within
* the planner, but simple translation of the varattnos for each partition
* will suffice. This only occurs for the INSERT case; UPDATE/DELETE
* cases are handled above.
*/

The comment "UPDATE/DELETE cases are handled above" is referring to
the code that initializes the WCOs generated by the planner. You've
modified the comment in your patch, but the associated code: your
updated comment says that only "DELETEs and local UPDATES are handled
above", but in reality, *all* updates are still handled above. And

Actually I meant, "above works for only local updates. For
row-movement-updates, we need per-leaf partition WCOs, because when
the row is inserted into target partition, that partition may be not
be included in the above planner resultRelInfo, so we need WCOs for
all partitions". I think this said comment should be sufficient if I
add this in the code ?

then they are handled again here.
Similarly for returning lists.
It's certainly not OK for the comment to be inaccurate, but I think
it's also bad to redo the work which the planner has already done,
even if it makes the patch smaller.

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway. I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken? If it works
for some reason, the comments don't explain what that reason is.

Not sure why parent relation should come into picture. As long as the
first result relation belongs to one of the partitions in the whole
partition tree, we should be able to use that to build WCOs of any
other partitions, because they have a common set of attributes having
the same name. So we are bound to find each of the attributes of first
resultRelInfo in the other leaf partitions during attno mapping.

Some other things:
+             * The row was already deleted by a concurrent DELETE. So we don't
+             * have anything to update.
I find this explanation, and the surrounding comments, inadequate. It
doesn't really explain why we're doing this. I think it should say
something like this: For a normal UPDATE, the case where the tuple has
been the subject of a concurrent UPDATE or DELETE would be handled by
the EvalPlanQual machinery, but for an UPDATE that we've translated
into a DELETE from this partition and an INSERT into some other
partition, that's not available, because CTID chains can't span
relation boundaries. We mimic the semantics to a limited extent by
skipping the INSERT if the DELETE fails to find a tuple. This ensures
that two concurrent attempts to UPDATE the same tuple at the same time
can't turn one tuple into two, and that an UPDATE of a just-deleted
tuple can't resurrect it.

Thanks, will put that comment in the next patch.

+            bool        partition_check_passed_with_trig_tuple;
+
+            partition_check_passed =
+                (resultRelInfo->ri_PartitionCheck &&
+                 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+            partition_check_passed_with_trig_tuple =
+                (resultRelInfo->ri_PartitionCheck &&
+                 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+            if (partition_check_passed)
+            {
+                /*
+                 * If it's the trigger that is causing partition constraint
+                 * violation, abort. We don't want a trigger to cause tuple
+                 * routing.
+                 */
+                if (!partition_check_passed_with_trig_tuple)
+                    ExecPartitionCheckEmitError(resultRelInfo,
+                                                trig_slot, estate);
+            }
+            else
+            {
+                /*
+                 * Partition constraint failed with original NEW tuple. But the
+                 * trigger might even have modifed the tuple such that it fits
+                 * back into the partition. So partition constraint check
+                 * should be based on *final* NEW tuple.
+                 */
+                partition_check_passed =
partition_check_passed_with_trig_tuple;
+            }
Maybe I inadvertently gave the contrary impression in some prior
review, but this logic doesn't seem right to me. I don't think
there's any problem with a BR UPDATE trigger causing tuple routing.
What I want to avoid is repeatedly rerouting the same tuple, but I
don't think that could happen even without this guard. We've now fixed
insert tuple routing so that a BR INSERT trigger can't cause the
partition constraint to be violated (cf. commit
15ce775faa428dc91027e4e2d6b7a167a27118b5) and there's no way for
update tuple routing to trigger additional BR UPDATE triggers. So I
don't see the point of checking the constraints twice here. I think
what you want to do is get rid of all the changes here and instead
adjust the logic just before ExecConstraints() to invoke
ExecPartitionCheck() on the post-trigger version of the tuple.

When I came up with this code, the intention was to make sure BR
UPDATE trigger does not cause tuple routing. But yeah, I can't recall
what made me think that the above changes would be needed to prevent
BR UPDATE trigger from causing tuple routing. With the latest code, it
indeed looks like we can get rid of these changes, and still prevent
that.

BTW, that code was not to avoid repeated re-routing.

Above, you seem to say that there's no problem with BR UPDATE trigger
causing the tuple routing. But, when none of the partition-key columns
are used in UPDATE, we don't set up for update-tuple-routing, so with
no partition-key update, tuple routing will not occur even if BR
UPDATE trigger would have caused UPDATE tuple routing. This is one
restriction we have to live with because we beforehand decide whether
to do the tuple-routing setup based on the columns modified in the
UPDATE query.

Parenthetically, if we decided to keep this logic as you have it, the
code that sets partition_check_passed and
partition_check_passed_with_trig_tuple doesn't need to check
resultRelInfo->ri_PartitionCheck because the surrounding "if" block
already did.

Yes.

+    for (i = 0; i < num_rels; i++)
+    {
+        ResultRelInfo *resultRelInfo = &result_rels[i];
+        Relation        rel = resultRelInfo->ri_RelationDesc;
+        Bitmapset     *expr_attrs = NULL;
+
+        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+            return true;
+    }

This seems like an awfully expensive way of performing this test.
Under what circumstances could this be true for some result relations
and false for others;

One resultRelinfo can have no partition key column used in its quals,
but the next resultRelinfo can have quite different quals, and these
quals can have partition key referred. This is possible if the two of
them have different parents that have different partition-key columns.

or in other words, why do we have to loop over all of the result
relations? It seems to me that the user has typed something like:

UPDATE whatever SET thingy = ..., whatsit = ... WHERE whatever = ...
AND thunk = ...

If either thingy or whatsit is a partitioning column, UPDATE tuple
routing might be needed

So, in the above code, bms_overlap() would return true if either
thingy or whatsit is a partitioning column.

- and it should be able to test that by a
*single* comparison between the set of columns being updated and the
partitioning columns, without needing to repeat for every partitions.

If bms_overlap() returns true for the very first resultRelinfo, it
will return immediately. But yes, if there are no relations using
partition key, we will have to scan all of these relations. But again,
note that these are pruned leaf partitions, they typically will not
contain all the leaf partitions.

Perhaps that test needs to be done at plan time and saved in the plan,
rather than performed here -- or maybe it's easy enough to do it here.

Hmm, it looks convenient here because mtstate->resultRelInfo gets set only here.

One problem is that, if BR UPDATE triggers are in fact allowed to
cause tuple routing as I proposed above, the presence of a BR UPDATE
trigger for any partition could necessitate UPDATE tuple routing for
queries that wouldn't otherwise need it.

You mean always setup update tuple routing if there's a BR UPDATE
trigger ? Actually I was going for disallowing BR update trigger to
initiate tuple routing, as I described above.

But even if you end up
inserting a test for that case, it can surely be a lot cheaper than
this,

I didn't exactly get why the bitmap_overlap() test needs to be
compared with the presence-of-trigger test.

since it only involves checking a boolean flag, not a bitmapset.
It could be argue that we ought to prohibit BR UPDATE triggers from
causing tuple routing so that we don't have to do this test at all,
but I'm not sure that's a good trade-off.
It seems to necessitate checking the partition constraint twice per
tuple instead of once per tuple, which like a very heavy price.

I think I didn't quite understand this paragraph as a whole. Can you
state the trade-off here again ?

+#define GetUpdatedColumns(relinfo, estate) \
+    (rt_fetch((relinfo)->ri_RangeTableIndex,
(estate)->es_range_table)->updatedCols)
I think this should be moved to a header file (and maybe turned into a
static inline function) rather than copy-pasting the definition into a
new file.

Will do that.

-            List       *mapped_wcoList;
+            List       *mappedWco;
List       *wcoExprs = NIL;
ListCell   *ll;

-            /* varno = node->nominalRelation */
-            mapped_wcoList = map_partition_varattnos(wcoList,
-                                                     node->nominalRelation,
-                                                     partrel, rel);
-            foreach(ll, mapped_wcoList)
+            mappedWco = map_partition_varattnos(firstWco, firstVarno,
+                                                partrel, firstResultRel);
+            foreach(ll, mappedWco)
{
WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-                                                   plan);
+                                                   &mtstate->ps);

wcoExprs = lappend(wcoExprs, wcoExpr);
}

-            resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+            resultRelInfo->ri_WithCheckOptions = mappedWco;

Renaming the variable looks fairly pointless, unless I'm missing something?

We are converting from firstWco to mappedWco. So firstWco => mappedWco
looks more natural pairing than firstWco => mapped_wcoList.

And I renamed wcoList to firstWco because I wanted to emphasize that
is the first WCO out of the node->withCheckOptionLists. In the
existing code, it was only for INSERT; withCheckOptionLists was a
single element list, so firstWco name didn't sound suitable, but with
multiple elements, it is essential to have it named firstWco so as to
emphasize that we take the first one irrespective of whether it is
UPDATE or INSERT.

Regarding the tests, it seems like you've got a test case where you
update a sub-partition and it fails because the tuple would need to be
moved out of a sub-tree, which is good. But I think it would also be
good to have a case where you update a sub-partition and it succeeds
in moving the tuple within the subtree. I don't see one like that
presently; it seems all the others update the topmost root or the
leaf. I also think it would be a good idea to make sub_parted's
column order different from both list_parted and its own children, and
maybe use a diversity of data types (e.g. int4, int8, text instead of
making everything int).
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
The extra space before the comma looks strange.

Will do the above changes, thanks.

Also, please make a habit of checking patches for whitespace errors
using git diff --check.
[rhaas pgsql]$ git diff --check
src/backend/executor/nodeModifyTable.c:384: indent with spaces.
+                        tuple, &slot);
src/backend/executor/nodeModifyTable.c:1966: space before tab in indent.
+                IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));
You will notice these kinds of things if you read the diff you are
submitting before you press send, because git highlights them in
bright red. That's a good practice for many other reasons, too.

Yeah, somehow I think I missed these because I must have checked only
the incremental diffs w.r.t. the earlier one where I must have
introduced them. Your point is very much true that we should make it a
habit to check complete patch with --check option, or apply it myself.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#104)

Re: UPDATE of partition key

On 21 June 2017 at 20:14, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 21, 2017 at 5:28 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:>> The comment "UPDATE/DELETE
cases are handled above" is referring to

the code that initializes the WCOs generated by the planner. You've
modified the comment in your patch, but the associated code: your
updated comment says that only "DELETEs and local UPDATES are handled
above", but in reality, *all* updates are still handled above. And
then they are handled again here. Similarly for returning lists.
It's certainly not OK for the comment to be inaccurate, but I think
it's also bad to redo the work which the planner has already done,
even if it makes the patch smaller.

I guess this has to do with the UPDATE turning into DELETE+INSERT. So, it
seems like WCOs are being initialized for the leaf partitions
(ResultRelInfos in the mt_partitions array) that are in turn are
initialized for the aforementioned INSERT. That's why the term "...local
UPDATEs" in the new comment text.

If that's true, I wonder if it makes sense to apply what would be
WCO_RLS_UPDATE_CHECK to a leaf partition that the tuple will be moved into
by calling ExecInsert()?

I think we probably should apply the insert policy, just as we're
executing the insert trigger.

Yes, the RLS quals should execute during tuple routing according to
whether it is a update or whether it has been converted to insert. I
think the tests don't quite test the insert part. Will check.

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway. I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken? If it works
for some reason, the comments don't explain what that reason is.

Yep, it's more appropriate to use
ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That
is, if answer to the question I raised above is positive.

From what I had checked earlier when coding that part,
rootResultRelInfo is NULL in case of inserts, unless something has
changed in later commits. That's the reason I decided to use the first
resultRelInfo.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#105)

Re: UPDATE of partition key

On Wed, Jun 21, 2017 at 1:37 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

e.g. the table is partitioned on order number, and you do UPDATE
lineitem SET product_code = 'K372B' WHERE product_code = 'K372'.

This query does not update order number, so here there is no
partition-key-update. Are you thinking that the patch is generating
the per-leaf-partition WCO expressions even for a update not involving
a partition key ?

No, it just wasn't a great example. Sorry.

Second, it will amount to a functional bug if you get a
different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

If the planner's WCOs and mapped WCOs are always the same, then I
think we should try to avoid generating both. If they can be
different, but that's intentional and correct, then there's no
substantive problem with the patch but the comments need to make it
clear why we are generating both.

Actually I meant, "above works for only local updates. For
row-movement-updates, we need per-leaf partition WCOs, because when
the row is inserted into target partition, that partition may be not
be included in the above planner resultRelInfo, so we need WCOs for
all partitions". I think this said comment should be sufficient if I
add this in the code ?

Let's not get too focused on updating the comment until we are in
agreement about what the code ought to be doing. I'm not clear
whether you accept the point that the patch needs to be changed to
avoid generating the same WCOs and returning lists in both the planner
and the executor.

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway. I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken? If it works
for some reason, the comments don't explain what that reason is.

Not sure why parent relation should come into picture. As long as the
first result relation belongs to one of the partitions in the whole
partition tree, we should be able to use that to build WCOs of any
other partitions, because they have a common set of attributes having
the same name. So we are bound to find each of the attributes of first
resultRelInfo in the other leaf partitions during attno mapping.

Well, at least for returning lists, we've got to generate the
returning lists so that they all match the column order of the parent,
not the parent's first child. Otherwise, for example, UPDATE
parent_table ... RETURNING * will not work correctly. The tuples
returned by the returning clause have to have the attribute order of
parent_table, not the attribute order of parent_table's first child.
I'm not sure whether WCOs have the same issue, but it's not clear to
me why they wouldn't: they contain a qual which is an expression tree,
and presumably there are Var nodes in there someplace, and if so, then
they have varattnos that have to be right for the purpose for which
they're going to be used.

+    for (i = 0; i < num_rels; i++)
+    {
+        ResultRelInfo *resultRelInfo = &result_rels[i];
+        Relation        rel = resultRelInfo->ri_RelationDesc;
+        Bitmapset     *expr_attrs = NULL;
+
+        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+            return true;
+    }
This seems like an awfully expensive way of performing this test.
Under what circumstances could this be true for some result relations
and false for others;
One resultRelinfo can have no partition key column used in its quals,
but the next resultRelinfo can have quite different quals, and these
quals can have partition key referred. This is possible if the two of
them have different parents that have different partition-key columns.

Hmm, true. So if we have a table foo that is partitioned by list (a),
and one of its children is a table bar that is partitioned by list
(b), then we need to consider doing tuple-routing if either column a
is modified, or if column b is modified for a partition which is a
descendant of bar. But visiting that only requires looking at the
partitioned table and those children that are also partitioned, not
all of the leaf partitions as the patch does.

- and it should be able to test that by a
*single* comparison between the set of columns being updated and the
partitioning columns, without needing to repeat for every partitions.

If bms_overlap() returns true for the very first resultRelinfo, it
will return immediately. But yes, if there are no relations using
partition key, we will have to scan all of these relations. But again,
note that these are pruned leaf partitions, they typically will not
contain all the leaf partitions.

But they might, and then this will be inefficient. Just because the
patch doesn't waste many cycles in the case where most partitions are
pruned doesn't mean that it's OK for it to waste cycles when few
partitions are pruned.

One problem is that, if BR UPDATE triggers are in fact allowed to
cause tuple routing as I proposed above, the presence of a BR UPDATE
trigger for any partition could necessitate UPDATE tuple routing for
queries that wouldn't otherwise need it.

You mean always setup update tuple routing if there's a BR UPDATE
trigger ?

Yes.

Actually I was going for disallowing BR update trigger to
initiate tuple routing, as I described above.

I know that! But as I said before, they requires evaluating every
partition key constraint twice per tuple, which seems very expensive.
I'm very doubtful that's a good approach.

But even if you end up
inserting a test for that case, it can surely be a lot cheaper than
this,

I didn't exactly get why the bitmap_overlap() test needs to be
compared with the presence-of-trigger test.

My point was: If you always set up tuple routing when a BR UPDATE
trigger is present, then you don't need to check the partition
constraint twice per tuple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#106)

Re: UPDATE of partition key

On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Yep, it's more appropriate to use
ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That
is, if answer to the question I raised above is positive.

From what I had checked earlier when coding that part,
rootResultRelInfo is NULL in case of inserts, unless something has
changed in later commits. That's the reason I decided to use the first
resultRelInfo.

We're just going around in circles here. Saying that you decided to
use the first child's resultRelInfo because you didn't have a
resultRelInfo for the parent is an explanation of why you wrote the
code the way you did, but that doesn't make it correct. I want to
know why you think it's correct.

I think it's probably wrong, because it seems to me that if the INSERT
code needs to use the parent's ResultRelInfo rather than the first
child's ResultRelInfo, the UPDATE code probably needs to do the same.
Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of
resultRelInfos for non-leaf partitions, and commit
e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back
for the topmost parent, because otherwise it didn't work correctly.
If every partition in the hierarchy has a different attribute
ordering, then it seems to me that it must surely matter which of
those attribute orderings we pick. It's hard to imagine that we can
pick *either* the parent's attribute ordering *or* that of the first
child and nothing will be different - the attribute numbers inside the
returning lists and WCOs we create have got to get used somehow, so
surely it matters which attribute numbers we use, doesn't it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#107)

Re: UPDATE of partition key

On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote:

Second, it will amount to a functional bug if you get a
different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

If the planner's WCOs and mapped WCOs are always the same, then I
think we should try to avoid generating both. If they can be
different, but that's intentional and correct, then there's no
substantive problem with the patch but the comments need to make it
clear why we are generating both.

Actually I meant, "above works for only local updates. For
row-movement-updates, we need per-leaf partition WCOs, because when
the row is inserted into target partition, that partition may be not
be included in the above planner resultRelInfo, so we need WCOs for
all partitions". I think this said comment should be sufficient if I
add this in the code ?

Let's not get too focused on updating the comment until we are in
agreement about what the code ought to be doing. I'm not clear
whether you accept the point that the patch needs to be changed to
avoid generating the same WCOs and returning lists in both the planner
and the executor.

Yes, we can re-use the WCOs generated in the planner, as an
optimization, since those we re-generate for the same relations will
look exactly the same. The WCOs generated by planner (in
inheritance_planner) are generated when (in adjust_appendrel_attrs())
we change attnos used in the query to refer to the child RTEs and this
adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
subplan resultRelInfo are actually the parent table WCOs, but only the
attnos changed. And in ExecInitModifyTable() we do the same thing for
leaf partitions, although using different function
map_variable_attnos().

Also, I feel like it's probably not correct to use the first result
relation as the nominal relation for building WCOs and returning lists
anyway. I mean, if the first result relation has a different column
order than the parent relation, isn't this just broken? If it works
for some reason, the comments don't explain what that reason is.

One thing I didn't mention earlier about the WCOs, is that for child
rels, we don't use the WCOs defined for the child rels. We only
inherit the WCO expressions defined for the root rel. That's the
reason they are the same expressions, only the attnos changed to match
the respective relation tupledesc. If the WCOs of each of the subplan
resultRelInfo() were different, then definitely it was not possible to
use the first resultRelinfo to generate other leaf partition WCOs,
because the WCO defined for relation A is independent of that defined
for relation B.

So, since the WCOs of all the relations are actually those of the
parent, we only need to adjust the attnos of any of these
resultRelInfos.

For e.g., if the root rel WCO is defined as "col > 5" where col is the
4th column, the expression will look like "var_1.attno_4 > 5". And the
WCO that is generated for a subplan resultRelInfo will look something
like "var_n.attno_2 > 5" if col is the 2nd column in this table.

All of the above logic assumes that we never use the WCO defined for
the child relation. At least that's how it looks by looking at the way
we generate WCOs in ExecInitModifyTable() for INSERTs as well looking
at the code in inheritance_planner() for UPDATEs. At both these
places, we never use the WCOs defined for child tables.

So suppose we define the tables and their WCOs like this :

CREATE TABLE range_parted ( a text, b int, c int) partition by range (a, b);

ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
GRANT ALL ON range_parted TO PUBLIC ;
create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);

create table part_b_10_b_20 partition of range_parted for values from
('b', 10) to ('b', 20) partition by range (c);

create table part_c_1_100 (b int, c int, a text);
alter table part_b_10_b_20 attach partition part_c_1_100 for values
from (1) to (100);
create table part_c_100_200 (c int, a text, b int);
alter table part_b_10_b_20 attach partition part_c_100_200 for values
from (100) to (200);

GRANT ALL ON part_c_100_200 TO PUBLIC ;
ALTER TABLE part_c_100_200 ENABLE ROW LEVEL SECURITY;
create policy seeall ON part_c_100_200 as PERMISSIVE for SELECT using ( true);

insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
insert into part_c_100_200 (a, b, c) values ('b', 17, 105);

-- For root table, allow updates only if NEW.c is even number.
create policy pu on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
-- For this table, allow updates only if NEW.c is divisible by 4.
create policy pu on part_c_100_200 for UPDATE USING (true) WITH CHECK
(c % 4 = 0);

Now, if we try to update the child table using UPDATE on root table,
it will allow setting c to a number which would otherwise violate WCO
constraint of the child table if the query would have run on the child
table directly :

postgres=# set role user1;
SET
postgres=> select tableoid::regclass, * from range_parted where b = 17;
tableoid | a | b | c
----------------+---+----+-----
part_c_100_200 | b | 17 | 105
-- root table does not allow updating c to odd numbers
postgres=> update range_parted set c = 107 where a = 'b' and b = 17 ;
ERROR: new row violates row-level security policy for table "range_parted"
-- child table does not allow to modify it to 106 because it is not
divisble by 4.
postgres=> update part_c_100_200 set c = 106 where a = 'b' and b = 17 ;
ERROR: new row violates row-level security policy for table "part_c_100_200"
-- But we can update it to 106 by running update on the root table,
because here child table WCOs won't get used.
postgres=> update range_parted set c = 106 where a = 'b' and b = 17 ;
UPDATE 1
postgres=> select tableoid::regclass, * from range_parted where b = 17;
tableoid | a | b | c
----------------+---+----+-----
part_c_100_200 | b | 17 | 106

Same applies for INSERTs. I hope this is expected behaviour. Initially
I had found this weird, but then saw that is consistent for both
inserts as well as updates.

Not sure why parent relation should come into picture. As long as the
first result relation belongs to one of the partitions in the whole
partition tree, we should be able to use that to build WCOs of any
other partitions, because they have a common set of attributes having
the same name. So we are bound to find each of the attributes of first
resultRelInfo in the other leaf partitions during attno mapping.

Well, at least for returning lists, we've got to generate the
returning lists so that they all match the column order of the parent,
not the parent's first child.
Otherwise, for example, UPDATE
parent_table ... RETURNING * will not work correctly. The tuples
returned by the returning clause have to have the attribute order of
parent_table, not the attribute order of parent_table's first child.
I'm not sure whether WCOs have the same issue, but it's not clear to
me why they wouldn't: they contain a qual which is an expression tree,
and presumably there are Var nodes in there someplace, and if so, then
they have varattnos that have to be right for the purpose for which
they're going to be used.

So once we put the attnos right according to the child relation
tupdesc, the rest part of generating the final RETURNING expressions
as per the root able column order is taken care of by the returning
projection, no ?

This scenario is included in the update.sql regression test here :
-- ok (row movement, with subset of rows moved into different partition)
update range_parted set b = b - 6 where c > 116 returning a, b + c;

+    for (i = 0; i < num_rels; i++)
+    {
+        ResultRelInfo *resultRelInfo = &result_rels[i];
+        Relation        rel = resultRelInfo->ri_RelationDesc;
+        Bitmapset     *expr_attrs = NULL;
+
+        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+            return true;
+    }
This seems like an awfully expensive way of performing this test.
Under what circumstances could this be true for some result relations
and false for others;
One resultRelinfo can have no partition key column used in its quals,
but the next resultRelinfo can have quite different quals, and these
quals can have partition key referred. This is possible if the two of
them have different parents that have different partition-key columns.
Hmm, true. So if we have a table foo that is partitioned by list (a),
and one of its children is a table bar that is partitioned by list
(b), then we need to consider doing tuple-routing if either column a
is modified, or if column b is modified for a partition which is a
descendant of bar. But visiting that only requires looking at the
partitioned table and those children that are also partitioned, not
all of the leaf partitions as the patch does.

Will give a thought on this and get back on this, and remaining points.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#109)

1 attachment(s)

Re: UPDATE of partition key

On 26 June 2017 at 08:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote:

Second, it will amount to a functional bug if you get a
different answer than the planner did.

Actually, the per-leaf WCOs are meant to be executed on the
destination partitions where the tuple is moved, while the WCOs
belonging to the per-subplan resultRelInfo are meant for the
resultRelinfo used for the UPDATE plans. So actually it should not
matter whether they look same or different, because they are fired at
different objects. Now these objects can happen to be the same
relations though.

But in any case, it's not clear to me how the mapped WCO and the
planner's WCO would yield a different answer if they are both the same
relation. I am possibly missing something. The planner has already
generated the withCheckOptions for each of the resultRelInfo. And then
we are using one of those to re-generate the WCO for a leaf partition
by only adjusting the attnos. If there is already a WCO generated in
the planner for that leaf partition (because that partition was
present in mtstate->resultRelInfo), then the re-built WCO should be
exactly look same as the earlier one, because they are the same
relations, and so the attnos generated in them would be same since the
Relation TupleDesc is the same.

If the planner's WCOs and mapped WCOs are always the same, then I
think we should try to avoid generating both. If they can be
different, but that's intentional and correct, then there's no
substantive problem with the patch but the comments need to make it
clear why we are generating both.

Actually I meant, "above works for only local updates. For
row-movement-updates, we need per-leaf partition WCOs, because when
the row is inserted into target partition, that partition may be not
be included in the above planner resultRelInfo, so we need WCOs for
all partitions". I think this said comment should be sufficient if I
add this in the code ?

Let's not get too focused on updating the comment until we are in
agreement about what the code ought to be doing. I'm not clear
whether you accept the point that the patch needs to be changed to
avoid generating the same WCOs and returning lists in both the planner
and the executor.

Yes, we can re-use the WCOs generated in the planner, as an
optimization, since those we re-generate for the same relations will
look exactly the same. The WCOs generated by planner (in
inheritance_planner) are generated when (in adjust_appendrel_attrs())
we change attnos used in the query to refer to the child RTEs and this
adjusts the attnos of the WCOs of the child RTEs. So the WCOs of
subplan resultRelInfo are actually the parent table WCOs, but only the
attnos changed. And in ExecInitModifyTable() we do the same thing for
leaf partitions, although using different function
map_variable_attnos().

In attached patch v12, during UPDATE tuple routing setup, for each
leaf partition, we now check if it is present already in one of the
UPDATE per-subplan resultrels. If present, we re-use them rather than
creating a new one and opening the table again.

So the mtstate->mt_partitions is now an array of ResultRelInfo
pointers. That pointer points to either the UPDATE per-subplan result
rel, or a newly allocated ResultRelInfo.

For each of the leaf partitions, we have to search through the
per-subplan resultRelInfo oids to check if there is a match. To do
this, I have created a temporary hash table which stores oids and the
ResultRelInfo pointers of mtstate->resultRelInfo array, and which can
be used to search the oid for each of the leaf partitions.

This patch version has handled only the above discussion point. I will
follow up with the other points separately.

Attachments:

update-partition-key_v12.patchapplication/octet-stream; name=update-partition-key_v12.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index f8c55b1..c9f5dd6 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -921,7 +921,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -931,8 +932,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent)
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel)
 {
 	AttrNumber *part_attnos;
 	bool		found_whole_row;
@@ -940,13 +941,13 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 											 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
+										RelationGetDescr(from_rel)->natts,
 										&found_whole_row);
 	/* There can never be a whole-row reference here */
 	if (found_whole_row)
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3c399e2..6187afe 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 
@@ -1418,13 +1418,13 @@ BeginCopy(ParseState *pstate,
 		if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		{
 			PartitionDispatch *partition_dispatch_info;
-			ResultRelInfo *partitions;
+			ResultRelInfo **partitions;
 			TupleConversionMap **partition_tupconv_maps;
 			TupleTableSlot *partition_tuple_slot;
 			int			num_parted,
 						num_partitions;
 
-			ExecSetupPartitionTupleRouting(rel,
+			ExecSetupPartitionTupleRouting(rel, NULL, 0,
 										   &partition_dispatch_info,
 										   &partitions,
 										   &partition_tupconv_maps,
@@ -2578,7 +2578,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2658,7 +2658,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2778,7 +2778,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7f0d21f..93cc953 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -64,6 +64,18 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+/*
+ * Entry of a temporary hash table. During UPDATE tuple routing, we want to
+ * know which of the leaf partitions are present in the UPDATE per-subplan
+ * resultRelInfo array (ModifyTableState->resultRelInfo[]). This hash table
+ * is searchable by the oids of the subplan result rels.
+ */
+typedef struct ResultRelOidsEntry
+{
+	Oid			rel_oid;
+	ResultRelInfo *resultRelInfo;
+} ResultRelOidsEntry;
+
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
@@ -103,8 +115,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
 
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
@@ -1823,15 +1833,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1859,51 +1864,65 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1911,7 +1930,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2024,8 +2044,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3190,10 +3211,14 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
  *		entry for every leaf partition (required to convert input tuple based
@@ -3213,8 +3238,10 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
@@ -3223,18 +3250,60 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	HTAB	   *result_rel_oids = NULL;
+	HASHCTL		ctl;
+	ResultRelOidsEntry *hash_entry;
+	ResultRelInfo *leaf_part_arr;
 
 	/* Get the tuple-routing information and lock partitions */
 	*pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
 										   &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+										   sizeof(ResultRelInfo*));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
 	/*
+	 * For Updates, if the leaf partition is already present in the per-subplan
+	 * result rels, we re-use that rather than initialize a new result rel. So
+	 * to find whether a given leaf partition already has a resultRel, we build
+	 * the hash table for searching each of the leaf partitions by oid.
+	 */
+	if (num_update_rri != 0)
+	{
+		ResultRelInfo	   *resultRelInfo;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(ResultRelOidsEntry);
+		ctl.hcxt = CurrentMemoryContext;
+		result_rel_oids = hash_create("result_rel_oids temporary hash",
+								32, /* start small and extend */
+								&ctl,
+								HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+		resultRelInfo = update_rri;
+		for (i = 0; i < num_update_rri; i++, resultRelInfo++)
+		{
+			Oid reloid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			hash_entry = hash_search(result_rel_oids, &reloid,
+									 HASH_ENTER, NULL);
+			hash_entry->resultRelInfo = resultRelInfo;
+		}
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid repeated
+		 * pallocs by allocating memory for all the result rels in bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
+	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
 	 * (such as ModifyTableState) and released when the node finishes
@@ -3242,23 +3311,65 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/*
+			 * If this leaf partition is already present in the per-subplan
+			 * resultRelInfos, re-use that resultRelInfo along with its
+			 * already-opened relation; otherwise create a new result rel.
+			 */
+			hash_entry = hash_search(result_rel_oids, &leaf_oid,
+									 HASH_FIND, NULL);
+			if (hash_entry != NULL)
+			{
+				leaf_part_rri = hash_entry->resultRelInfo;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root partition
+				 * tuple descriptor. When generating the update plans, this was
+				 * not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf partitions.
+			 * Note that each of the newly opened relations in *partitions are
+			 * eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri, partrel, 1 /* dummy */, rel, 0);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
-		 * Verify result relation is a valid target for the current operation.
+		 * Verify result relation is a valid target for insert operation.
+		 * Even for updates, we are doing this for tuple-routing, so again,
+		 * we need to check the validity for insert operation.
 		 */
 		CheckValidResultRel(partrel, CMD_INSERT);
 
@@ -3269,12 +3380,6 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  1,	/* dummy */
-						  rel,
-						  0);
-
 		/*
 		 * Open partition indices (remember we do not support ON CONFLICT in
 		 * case of partitioned tables, so we do not need support information
@@ -3284,9 +3389,12 @@ ExecSetupPartitionTupleRouting(Relation rel,
 			leaf_part_rri->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(leaf_part_rri, false);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	if (result_rel_oids != NULL)
+		hash_destroy(result_rel_oids);
 }
 
 /*
@@ -3312,8 +3420,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple it if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 59f14e9..9eb2b7e 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -402,7 +402,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -461,7 +461,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 5e43a06..45df343 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,6 +54,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
@@ -239,6 +242,34 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_slot. If no mapping present, keeps
+ * p_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple, TupleTableSlot **p_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_slot = mtstate->mt_partition_tuple_slot;
+	Assert(*p_slot != NULL);
+	ExecSetSlotDescriptor(*p_slot, map->outdesc);
+	ExecStoreTuple(tuple, *p_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -280,7 +311,38 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs to
+		 * be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_resultrel_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans-1)
+		{
+			int		map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+									  mtstate->mt_resultrel_maps[map_index],
+									  tuple, &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -290,7 +352,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -302,7 +364,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -317,23 +379,9 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+						mtstate->mt_partition_tupconv_maps[leaf_part_index],
+						tuple, &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -451,7 +499,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -641,6 +689,8 @@ ExecDelete(ItemPointer tupleid,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -649,6 +699,9 @@ ExecDelete(ItemPointer tupleid,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -792,6 +845,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -815,8 +870,8 @@ ldelete:;
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -894,7 +949,8 @@ ldelete:;
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
-ExecUpdate(ItemPointer tupleid,
+ExecUpdate(ModifyTableState *mtstate,
+		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *slot,
 		   TupleTableSlot *planSlot,
@@ -908,6 +964,8 @@ ExecUpdate(ItemPointer tupleid,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	bool		partition_check_passed = true;
+	bool		has_br_trigger;
 
 	/*
 	 * abort the operation if not running transactions
@@ -928,16 +986,56 @@ ExecUpdate(ItemPointer tupleid,
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
 
 	/* BEFORE ROW UPDATE Triggers */
-	if (resultRelInfo->ri_TrigDesc &&
-		resultRelInfo->ri_TrigDesc->trig_update_before_row)
+	has_br_trigger = (resultRelInfo->ri_TrigDesc &&
+					  resultRelInfo->ri_TrigDesc->trig_update_before_row);
+
+	if (has_br_trigger)
 	{
-		slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-									tupleid, oldtuple, slot);
+		TupleTableSlot *trig_slot;
 
-		if (slot == NULL)		/* "do nothing" */
+		trig_slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
+										 tupleid, oldtuple, slot);
+
+		if (trig_slot == NULL)		/* "do nothing" */
 			return NULL;
 
+		if (resultRelInfo->ri_PartitionCheck)
+		{
+			bool		partition_check_passed_with_trig_tuple;
+
+			partition_check_passed =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+			partition_check_passed_with_trig_tuple =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+
+			if (partition_check_passed)
+			{
+				/*
+				 * If it's the trigger that is causing partition constraint
+				 * violation, abort. We don't want a trigger to cause tuple
+				 * routing.
+				 */
+				if (!partition_check_passed_with_trig_tuple)
+					ExecPartitionCheckEmitError(resultRelInfo,
+												trig_slot, estate);
+			}
+			else
+			{
+				/*
+				 * Partition constraint failed with original NEW tuple. But the
+				 * trigger might even have modifed the tuple such that it fits
+				 * back into the partition. So partition constraint check
+				 * should be based on *final* NEW tuple.
+				 */
+				partition_check_passed = partition_check_passed_with_trig_tuple;
+			}
+		}
+
 		/* trigger might have changed tuple */
+		slot = trig_slot;
 		tuple = ExecMaterializeSlot(slot);
 	}
 
@@ -1004,12 +1102,60 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition. With a BR trigger, the tuple has already gone through EPQ
+		 * and has been locked; so it won't change again. So, avoid an extra
+		 * partition check if we already did it above in the presence of BR
+		 * triggers.
+		 */
+		if (!has_br_trigger)
+		{
+			partition_check_passed =
+				(!resultRelInfo->ri_PartitionCheck ||
+				ExecPartitionCheck(resultRelInfo, slot, estate));
+		}
+
+		if (!partition_check_passed)
+		{
+			bool	concurrently_deleted;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with partition
+			 * constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
+					   &concurrently_deleted, false, false);
+
+			/*
+			 * The row was already deleted by a concurrent DELETE. So we don't
+			 * have anything to update.
+			 */
+			if (concurrently_deleted)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1329,7 +1475,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 	 */
 
 	/* Execute UPDATE with projection */
-	*returning = ExecUpdate(&tuple.t_self, NULL,
+	*returning = ExecUpdate(mtstate, &tuple.t_self, NULL,
 							mtstate->mt_conflproj, planSlot,
 							&mtstate->mt_epqstate, mtstate->ps.state,
 							canSetTag);
@@ -1411,6 +1557,35 @@ fireASTriggers(ModifyTableState *node)
 	}
 }
 
+/*
+ * Check whether partition key is modified for any of the relations.
+ */
+static bool
+IsPartitionKeyUpdate(EState *estate, ResultRelInfo *result_rels, int num_rels)
+{
+	int		i;
+
+	/*
+	 * Each of the result relations has the updated columns set stored
+	 * according to its own column ordering. So we need to pull the attno of
+	 * the partition quals of each of the relations, and check if the updated
+	 * column attributes are present in the vars in the partition quals.
+	 */
+	for (i = 0; i < num_rels; i++)
+	{
+		ResultRelInfo *resultRelInfo = &result_rels[i];
+		Relation		rel = resultRelInfo->ri_RelationDesc;
+		Bitmapset	  *expr_attrs = NULL;
+
+		pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+		/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+		if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+			return true;
+	}
+
+	return false;
+}
 
 /* ----------------------------------------------------------------
  *	   ExecModifyTable
@@ -1618,12 +1793,13 @@ ExecModifyTable(ModifyTableState *node)
 								  estate, node->canSetTag);
 				break;
 			case CMD_UPDATE:
-				slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
+				slot = ExecUpdate(node, tupleid, oldtuple, slot, planSlot,
 								  &node->mt_epqstate, estate, node->canSetTag);
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1663,11 +1839,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 {
 	ModifyTableState *mtstate;
 	CmdType		operation = node->operation;
+	bool		is_partitionkey_update = false;
 	int			nplans = list_length(node->plans);
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
@@ -1779,18 +1958,30 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Remember whether it is going to be an update of partition key. */
+	is_partitionkey_update =
+				(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+				operation == CMD_UPDATE &&
+				IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || is_partitionkey_update))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+											mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   &partition_dispatch_info,
 									   &partitions,
 									   &partition_tupconv_maps,
@@ -1802,6 +1993,43 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_partitions = num_partitions;
 		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
+	}
+
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root partition
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root. Skip this setup if it's not a partition key update or if
+	 * there are no partitions below this partitioned table.
+	 */
+	if (is_partitionkey_update && mtstate->mt_num_partitions > 0)
+	{
+		TupleConversionMap **tup_conv_maps;
+		TupleDesc		outdesc;
+
+		mtstate->mt_resultrel_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap*) * nplans);
+
+		/* Get tuple descriptor of the root partition. */
+		outdesc = RelationGetDescr(mtstate->mt_partition_dispatch_info[0]->reldesc);
+
+		resultRelInfo = mtstate->resultRelInfo;
+		tup_conv_maps = mtstate->mt_resultrel_maps;
+		for (i = 0; i < nplans; i++)
+		{
+			TupleDesc indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+			tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+								 gettext_noop("could not convert row type"));
+		}
 	}
 
 	/*
@@ -1834,50 +2062,52 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE
+	 * row movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO qual
+		 * for each partition. Note that, if there are SubPlans in there, they
+		 * all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
+		Assert(is_partitionkey_update ||
+			   (operation == CMD_INSERT &&
 			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+			   mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco, firstVarno,
+												partrel, firstResultRel);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -1888,7 +2118,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -1925,20 +2155,25 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList, firstVarno,
+											partrel, firstResultRel);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2181,6 +2416,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2213,7 +2449,17 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index f10879a..b1a60c2 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -79,8 +79,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent);
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index e25cfa3..ea4205d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,8 +210,10 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -216,6 +221,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 54c5cf5..3f3b732 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -959,9 +959,13 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
 	/* Per partition tuple conversion map */
+	TupleConversionMap **mt_partition_tupconv_maps;
+	/* Per resultRelInfo conversion map to convert tuples to root partition */
+	TupleConversionMap **mt_resultrel_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
 } ModifyTableState;
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..f3c03a7 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,189 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
 ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (120, b, 15).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_parted" violates partition constraint
+DETAIL:  Failing row contains (2, 2, 10).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+ERROR:  new row for relation "sub_part1" violates partition constraint
+DETAIL:  Failing row contains (2, 70, 1).
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..0113c7d 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,128 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+insert into part_a_1_a_10 values ('a', 1);
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#111

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#108)

Re: UPDATE of partition key

On 22 June 2017 at 01:57, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 21, 2017 at 1:38 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Yep, it's more appropriate to use
ModifyTableState->rootResultRelationInfo->ri_RelationDesc somehow. That
is, if answer to the question I raised above is positive.

From what I had checked earlier when coding that part,
rootResultRelInfo is NULL in case of inserts, unless something has
changed in later commits. That's the reason I decided to use the first
resultRelInfo.

We're just going around in circles here. Saying that you decided to
use the first child's resultRelInfo because you didn't have a
resultRelInfo for the parent is an explanation of why you wrote the
code the way you did, but that doesn't make it correct. I want to
know why you think it's correct.

Yeah, that was just an FYI on how I decided to use the first
resultRelInfo; it was not for explaining why using first resultRelInfo
is correct. So upthread, I have tried to explain.

I think it's probably wrong, because it seems to me that if the INSERT
code needs to use the parent's ResultRelInfo rather than the first
child's ResultRelInfo, the UPDATE code probably needs to do the same.
Commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 got rid of
resultRelInfos for non-leaf partitions, and commit
e180c8aa8caf5c55a273d4a8e6092e77ff3cff10 added the resultRelInfo back
for the topmost parent, because otherwise it didn't work correctly.

Regarding rootResultRelInfo , it would have been good if
rootResultRelInfo was set for both insert and update, but it isn't set
for inserts.....

For inserts :
In ExecInitModifyTable(), ModifyTableState->rootResultRelInfo remains
NULL because ModifyTable->rootResultRelIndex is = -1 :
/* If modifying a partitioned table, initialize the root table info */
if (node->rootResultRelIndex >= 0)
mtstate->rootResultRelInfo = estate->es_root_result_relations +
node->rootResultRelIndex;

ModifyTable->rootResultRelIndex = -1 because it does not get set since
ModifyTable->partitioned_rels is NULL :

/*
* If the main target relation is a partitioned table, the
* following list contains the RT indexes of partitioned child
* relations including the root, which are not included in the
* above list. We also keep RT indexes of the roots
* separately to be identitied as such during the executor
* initialization.
*/
if (splan->partitioned_rels != NIL)
{
root->glob->nonleafResultRelations =
list_concat(root->glob->nonleafResultRelations,
list_copy(splan->partitioned_rels));
/* Remember where this root will be in the global list. */
splan->rootResultRelIndex = list_length(root->glob->rootResultRelations);
root->glob->rootResultRelations =
lappend_int(root->glob->rootResultRelations,
linitial_int(splan->partitioned_rels));
}

ModifyTable->partitioned_rels is NULL because inheritance_planner()
does not get called for INSERTs; instead, grouping_planner() gets
called :

subquery_planner()
{
/*
* Do the main planning. If we have an inherited target relation, that
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation && rt_fetch(parse->resultRelation,
parse->rtable)->inh)
inheritance_planner(root);
else
grouping_planner(root, false, tuple_fraction);

}

Above, inh is false in case of inserts.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Amit Khandekar (#110)

Re: UPDATE of partition key

Hi Amit,

On 2017/06/28 20:43, Amit Khandekar wrote:

In attached patch v12

The patch no longer applies and fails to compile after the following
commit was made yesterday:

commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
Author: Andrew Gierth <rhodiumtoad@postgresql.org>
Date: Wed Jun 28 18:55:03 2017 +0100

Fix transition tables for partition/inheritance.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Langote (#112)

1 attachment(s)

Re: UPDATE of partition key

On 29 June 2017 at 07:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Amit,

On 2017/06/28 20:43, Amit Khandekar wrote:

In attached patch v12

The patch no longer applies and fails to compile after the following
commit was made yesterday:

commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
Author: Andrew Gierth <rhodiumtoad@postgresql.org>
Date: Wed Jun 28 18:55:03 2017 +0100

Fix transition tables for partition/inheritance.

Thanks for informing Amit.

As Thomas mentioned upthread, the above commit already uses a tuple
conversion mapping from leaf partition to root partitioned table
(mt_transition_tupconv_maps), which serves the same purpose as that of
the mapping used in the update-partition-key patch during update tuple
routing (mt_resultrel_maps).

We need to try to merge these two into a general-purpose mapping array
such as mt_leaf_root_maps. I haven't done that in the rebased patch
(attached), so currently it has both these mapping fields.

For transition tables, this map is per-leaf-partition in case of
inserts, whereas it is per-subplan result rel for updates. For
update-tuple routing, the mapping is required to be per-subplan. Now,
for update-row-movement in presence of transition tables, we would
require both per-subplan mapping as well as per-leaf-partition
mapping, which can't be done if we have a single mapping field, unless
we have some way to identify which of the per-leaf partition mapping
elements belong to per-subplan rels.

So, it's not immediately possible to merge them.

Attachments:

update-partition-key_v12_rebased.patchapplication/octet-stream; name=update-partition-key_v12_rebased.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index f8c55b1..c9f5dd6 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -921,7 +921,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -931,8 +932,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent)
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel)
 {
 	AttrNumber *part_attnos;
 	bool		found_whole_row;
@@ -940,13 +941,13 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 											 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
+										RelationGetDescr(from_rel)->natts,
 										&found_whole_row);
 	/* There can never be a whole-row reference here */
 	if (found_whole_row)
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f391828..2706af2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -1426,13 +1426,13 @@ BeginCopy(ParseState *pstate,
 		if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		{
 			PartitionDispatch *partition_dispatch_info;
-			ResultRelInfo *partitions;
+			ResultRelInfo **partitions;
 			TupleConversionMap **partition_tupconv_maps;
 			TupleTableSlot *partition_tuple_slot;
 			int			num_parted,
 						num_partitions;
 
-			ExecSetupPartitionTupleRouting(rel,
+			ExecSetupPartitionTupleRouting(rel, NULL, 0,
 										   &partition_dispatch_info,
 										   &partitions,
 										   &partition_tupconv_maps,
@@ -1461,7 +1461,7 @@ BeginCopy(ParseState *pstate,
 				for (i = 0; i < cstate->num_partitions; ++i)
 				{
 					cstate->transition_tupconv_maps[i] =
-						convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+						convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 											   RelationGetDescr(rel),
 											   gettext_noop("could not convert row type"));
 				}
@@ -2608,7 +2608,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2717,7 +2717,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2837,7 +2837,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0f08283..e448d18 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -64,6 +64,18 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+/*
+ * Entry of a temporary hash table. During UPDATE tuple routing, we want to
+ * know which of the leaf partitions are present in the UPDATE per-subplan
+ * resultRelInfo array (ModifyTableState->resultRelInfo[]). This hash table
+ * is searchable by the oids of the subplan result rels.
+ */
+typedef struct ResultRelOidsEntry
+{
+	Oid			rel_oid;
+	ResultRelInfo *resultRelInfo;
+} ResultRelOidsEntry;
+
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
@@ -103,8 +115,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
 
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
@@ -1823,15 +1833,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1859,51 +1864,65 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1911,7 +1930,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2024,8 +2044,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3190,10 +3211,14 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
  *		entry for every leaf partition (required to convert input tuple based
@@ -3213,8 +3238,10 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
@@ -3223,18 +3250,60 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	HTAB	   *result_rel_oids = NULL;
+	HASHCTL		ctl;
+	ResultRelOidsEntry *hash_entry;
+	ResultRelInfo *leaf_part_arr;
 
 	/* Get the tuple-routing information and lock partitions */
 	*pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
 										   &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+										   sizeof(ResultRelInfo*));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
 	/*
+	 * For Updates, if the leaf partition is already present in the per-subplan
+	 * result rels, we re-use that rather than initialize a new result rel. So
+	 * to find whether a given leaf partition already has a resultRel, we build
+	 * the hash table for searching each of the leaf partitions by oid.
+	 */
+	if (num_update_rri != 0)
+	{
+		ResultRelInfo	   *resultRelInfo;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(ResultRelOidsEntry);
+		ctl.hcxt = CurrentMemoryContext;
+		result_rel_oids = hash_create("result_rel_oids temporary hash",
+								32, /* start small and extend */
+								&ctl,
+								HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+		resultRelInfo = update_rri;
+		for (i = 0; i < num_update_rri; i++, resultRelInfo++)
+		{
+			Oid reloid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			hash_entry = hash_search(result_rel_oids, &reloid,
+									 HASH_ENTER, NULL);
+			hash_entry->resultRelInfo = resultRelInfo;
+		}
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid repeated
+		 * pallocs by allocating memory for all the result rels in bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
+	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
 	 * (such as ModifyTableState) and released when the node finishes
@@ -3242,23 +3311,65 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/*
+			 * If this leaf partition is already present in the per-subplan
+			 * resultRelInfos, re-use that resultRelInfo along with its
+			 * already-opened relation; otherwise create a new result rel.
+			 */
+			hash_entry = hash_search(result_rel_oids, &leaf_oid,
+									 HASH_FIND, NULL);
+			if (hash_entry != NULL)
+			{
+				leaf_part_rri = hash_entry->resultRelInfo;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root partition
+				 * tuple descriptor. When generating the update plans, this was
+				 * not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf partitions.
+			 * Note that each of the newly opened relations in *partitions are
+			 * eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri, partrel, 1 /* dummy */, rel, 0);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
-		 * Verify result relation is a valid target for the current operation.
+		 * Verify result relation is a valid target for insert operation.
+		 * Even for updates, we are doing this for tuple-routing, so again,
+		 * we need to check the validity for insert operation.
 		 */
 		CheckValidResultRel(partrel, CMD_INSERT);
 
@@ -3269,12 +3380,6 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  1,	/* dummy */
-						  rel,
-						  0);
-
 		/*
 		 * Open partition indices (remember we do not support ON CONFLICT in
 		 * case of partitioned tables, so we do not need support information
@@ -3284,9 +3389,12 @@ ExecSetupPartitionTupleRouting(Relation rel,
 			leaf_part_rri->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(leaf_part_rri, false);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	if (result_rel_oids != NULL)
+		hash_destroy(result_rel_oids);
 }
 
 /*
@@ -3312,8 +3420,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple it if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index bc53d07..eca60f2 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -402,7 +402,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -467,7 +467,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8d17425..51931f4 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,6 +54,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
@@ -239,6 +242,34 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_slot. If no mapping present, keeps
+ * p_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple, TupleTableSlot **p_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_slot = mtstate->mt_partition_tuple_slot;
+	Assert(*p_slot != NULL);
+	ExecSetSlotDescriptor(*p_slot, map->outdesc);
+	ExecStoreTuple(tuple, *p_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -280,7 +311,38 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs to
+		 * be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_resultrel_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans-1)
+		{
+			int		map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+									  mtstate->mt_resultrel_maps[map_index],
+									  tuple, &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -290,7 +352,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -302,7 +364,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -347,23 +409,9 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+						mtstate->mt_partition_tupconv_maps[leaf_part_index],
+						tuple, &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -481,7 +529,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -673,6 +721,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -681,6 +731,9 @@ ExecDelete(ModifyTableState *mtstate,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -824,6 +877,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -848,8 +903,8 @@ ldelete:;
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
 						 mtstate->mt_transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -942,6 +997,8 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	bool		partition_check_passed = true;
+	bool		has_br_trigger;
 
 	/*
 	 * abort the operation if not running transactions
@@ -962,16 +1019,56 @@ ExecUpdate(ModifyTableState *mtstate,
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
 
 	/* BEFORE ROW UPDATE Triggers */
-	if (resultRelInfo->ri_TrigDesc &&
-		resultRelInfo->ri_TrigDesc->trig_update_before_row)
+	has_br_trigger = (resultRelInfo->ri_TrigDesc &&
+					  resultRelInfo->ri_TrigDesc->trig_update_before_row);
+
+	if (has_br_trigger)
 	{
-		slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-									tupleid, oldtuple, slot);
+		TupleTableSlot *trig_slot;
 
-		if (slot == NULL)		/* "do nothing" */
+		trig_slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
+										 tupleid, oldtuple, slot);
+
+		if (trig_slot == NULL)		/* "do nothing" */
 			return NULL;
 
+		if (resultRelInfo->ri_PartitionCheck)
+		{
+			bool		partition_check_passed_with_trig_tuple;
+
+			partition_check_passed =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, slot, estate));
+
+			partition_check_passed_with_trig_tuple =
+				(resultRelInfo->ri_PartitionCheck &&
+				 ExecPartitionCheck(resultRelInfo, trig_slot, estate));
+
+			if (partition_check_passed)
+			{
+				/*
+				 * If it's the trigger that is causing partition constraint
+				 * violation, abort. We don't want a trigger to cause tuple
+				 * routing.
+				 */
+				if (!partition_check_passed_with_trig_tuple)
+					ExecPartitionCheckEmitError(resultRelInfo,
+												trig_slot, estate);
+			}
+			else
+			{
+				/*
+				 * Partition constraint failed with original NEW tuple. But the
+				 * trigger might even have modifed the tuple such that it fits
+				 * back into the partition. So partition constraint check
+				 * should be based on *final* NEW tuple.
+				 */
+				partition_check_passed = partition_check_passed_with_trig_tuple;
+			}
+		}
+
 		/* trigger might have changed tuple */
+		slot = trig_slot;
 		tuple = ExecMaterializeSlot(slot);
 	}
 
@@ -1038,12 +1135,60 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition. With a BR trigger, the tuple has already gone through EPQ
+		 * and has been locked; so it won't change again. So, avoid an extra
+		 * partition check if we already did it above in the presence of BR
+		 * triggers.
+		 */
+		if (!has_br_trigger)
+		{
+			partition_check_passed =
+				(!resultRelInfo->ri_PartitionCheck ||
+				ExecPartitionCheck(resultRelInfo, slot, estate));
+		}
+
+		if (!partition_check_passed)
+		{
+			bool	concurrently_deleted;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with partition
+			 * constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &concurrently_deleted, false, false);
+
+			/*
+			 * The row was already deleted by a concurrent DELETE. So we don't
+			 * have anything to update.
+			 */
+			if (concurrently_deleted)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1462,6 +1607,36 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Check whether partition key is modified for any of the relations.
+ */
+static bool
+IsPartitionKeyUpdate(EState *estate, ResultRelInfo *result_rels, int num_rels)
+{
+	int		i;
+
+	/*
+	 * Each of the result relations has the updated columns set stored
+	 * according to its own column ordering. So we need to pull the attno of
+	 * the partition quals of each of the relations, and check if the updated
+	 * column attributes are present in the vars in the partition quals.
+	 */
+	for (i = 0; i < num_rels; i++)
+	{
+		ResultRelInfo *resultRelInfo = &result_rels[i];
+		Relation		rel = resultRelInfo->ri_RelationDesc;
+		Bitmapset	  *expr_attrs = NULL;
+
+		pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+		/* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+		if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+			return true;
+	}
+
+	return false;
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1482,23 +1657,22 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	 */
 	if (mtstate->mt_transition_capture != NULL)
 	{
-		ResultRelInfo *resultRelInfos;
+		ResultRelInfo *resultRelInfo;
 		int		numResultRelInfos;
+		bool	tuple_routing = (mtstate->mt_partition_dispatch_info != NULL);
 
 		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (tuple_routing)
 		{
 			/*
 			 * For INSERT via partitioned table, so we need TupleDescs based
 			 * on the partition routing table.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
 			numResultRelInfos = mtstate->mt_num_partitions;
 		}
 		else
 		{
 			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
 			numResultRelInfos = mtstate->mt_nplans;
 		}
 
@@ -1512,8 +1686,15 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 		for (i = 0; i < numResultRelInfos; ++i)
 		{
+			/*
+			 * As stated above, mapping source is different for INSERT or
+			 * otherwise.
+			 */
+			resultRelInfo = (tuple_routing ?
+					mtstate->mt_partitions[i] : &mtstate->resultRelInfo[i]);
+
 			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
@@ -1746,7 +1927,8 @@ ExecModifyTable(ModifyTableState *node)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1786,11 +1968,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 {
 	ModifyTableState *mtstate;
 	CmdType		operation = node->operation;
+	bool		is_partitionkey_update = false;
 	int			nplans = list_length(node->plans);
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
@@ -1902,18 +2087,30 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Remember whether it is going to be an update of partition key. */
+	is_partitionkey_update =
+				(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+				operation == CMD_UPDATE &&
+				IsPartitionKeyUpdate(estate, mtstate->resultRelInfo, nplans));
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || is_partitionkey_update))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+											mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   &partition_dispatch_info,
 									   &partitions,
 									   &partition_tupconv_maps,
@@ -1925,6 +2122,43 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_partitions = num_partitions;
 		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
+	}
+
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root partition
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root. Skip this setup if it's not a partition key update or if
+	 * there are no partitions below this partitioned table.
+	 */
+	if (is_partitionkey_update && mtstate->mt_num_partitions > 0)
+	{
+		TupleConversionMap **tup_conv_maps;
+		TupleDesc		outdesc;
+
+		mtstate->mt_resultrel_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap*) * nplans);
+
+		/* Get tuple descriptor of the root partition. */
+		outdesc = RelationGetDescr(mtstate->mt_partition_dispatch_info[0]->reldesc);
+
+		resultRelInfo = mtstate->resultRelInfo;
+		tup_conv_maps = mtstate->mt_resultrel_maps;
+		for (i = 0; i < nplans; i++)
+		{
+			TupleDesc indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+			tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+								 gettext_noop("could not convert row type"));
+		}
 	}
 
 	/* Build state for collecting transition tuples */
@@ -1960,50 +2194,52 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE
+	 * row movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO qual
+		 * for each partition. Note that, if there are SubPlans in there, they
+		 * all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
+		Assert(is_partitionkey_update ||
+			   (operation == CMD_INSERT &&
 			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+			   mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco, firstVarno,
+												partrel, firstResultRel);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2014,7 +2250,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2051,20 +2287,25 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList, firstVarno,
+											partrel, firstResultRel);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2307,6 +2548,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/* Free transition tables */
 	if (node->mt_transition_capture != NULL)
@@ -2343,7 +2585,17 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index f10879a..b1a60c2 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -79,8 +79,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent);
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index e25cfa3..ea4205d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,8 +210,10 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -216,6 +221,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 85fac8a..276b65b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -959,9 +959,13 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
 	/* Per partition tuple conversion map */
+	TupleConversionMap **mt_partition_tupconv_maps;
+	/* Per resultRelInfo conversion map to convert tuples to root partition */
+	TupleConversionMap **mt_resultrel_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 									/* controls transition table population */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..f3c03a7 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,189 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
 ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+DETAIL:  Failing row contains (b, 7, 117).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (120, b, 15).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_parted" violates partition constraint
+DETAIL:  Failing row contains (2, 2, 10).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+ERROR:  new row for relation "sub_part1" violates partition constraint
+DETAIL:  Failing row contains (2, 70, 1).
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..0113c7d 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,128 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+insert into part_a_1_a_10 values ('a', 1);
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers should not be allowed to initiate the update row movement
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- THis is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+-- This should fail because trigger on sub_part1 would change column 'b' which
+-- would violate "b in (1)" constraint.
+update list_parted set c = 70 where b  = 1 ;
+drop trigger parted_mod_b ON sub_part1 ;
+-- Now that the trigger is dropped, the same update should succeed
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#114

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#107)

Re: UPDATE of partition key

On 22 June 2017 at 01:41, Robert Haas <robertmhaas@gmail.com> wrote:

+    for (i = 0; i < num_rels; i++)
+    {
+        ResultRelInfo *resultRelInfo = &result_rels[i];
+        Relation        rel = resultRelInfo->ri_RelationDesc;
+        Bitmapset     *expr_attrs = NULL;
+
+        pull_varattnos((Node *) rel->rd_partcheck, 1, &expr_attrs);
+
+        /* Both bitmaps are offset by FirstLowInvalidHeapAttributeNumber. */
+        if (bms_overlap(expr_attrs, GetUpdatedColumns(resultRelInfo, estate)))
+            return true;
+    }
This seems like an awfully expensive way of performing this test.
Under what circumstances could this be true for some result relations
and false for others;
One resultRelinfo can have no partition key column used in its quals,
but the next resultRelinfo can have quite different quals, and these
quals can have partition key referred. This is possible if the two of
them have different parents that have different partition-key columns.
Hmm, true. So if we have a table foo that is partitioned by list (a),
and one of its children is a table bar that is partitioned by list
(b), then we need to consider doing tuple-routing if either column a
is modified, or if column b is modified for a partition which is a
descendant of bar. But visiting that only requires looking at the
partitioned table and those children that are also partitioned, not
all of the leaf partitions as the patch does.

The main concern is that the non-leaf partitions are not open (except
root), so we would need to open them in order to get the partition key
of the parents of update resultrels (or get only the partition key
atts and exprs from pg_partitioned_table).

There can be multiple approaches to finding partition key columns.

Approach 1 : When there are a few update result rels and a large
partition tree, we traverse from each of the result rels to their
ancestors , and open their ancestors (get_partition_parent()) to get
the partition key columns. For result rels having common parents, do
this only once.

Approach 2 : If there are only a few partitioned tables, and large
number of update result rels, it would be easier to just open all the
partitioned tables and form the partition key column bitmap out of all
their partition keys. If the bitmap does not have updated columns,
that's not a partition-key-update. So for typical non-partition-key
updates, just opening the partitioned tables will suffice, and so that
would not affect performance of normal updates.

But if the bitmap has updated columns, we can't conclude that it's a
partition-key-update, otherwise it would be false positive. We again
need to further check whether the update result rels belong to
ancestors that have updated partition keys.

Approach 3 : In RelationData, in a new bitmap field (rd_partcheckattrs
?), store partition key attrs that are used in rd_partcheck . Populate
this field during generate_partition_qual().

So to conclude, I think, we can do this :

Scenario 1 :
Only one partitioned table : the root; rest all are leaf partitions.
In this case, it is definitely efficient to just check the root
partition key, which will be sufficient.

Scenario 2 :
There are few non-leaf partitioned tables (3-4) :
Open those tables, and follow 2nd approach above: If we don't find any
updated partition-keys in any of them, well and good. If we do find,
failover to approach 3 : For each of the update resultrels, use the
new rd_partcheckattrs bitmap to know if it uses any of the updated
columns. This would be faster than pulling up attrs from the quals
like how it was done in the patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#114)

1 attachment(s)

Re: UPDATE of partition key

On Thu, Jun 29, 2017 at 3:52 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

So to conclude, I think, we can do this :

Scenario 1 :
Only one partitioned table : the root; rest all are leaf partitions.
In this case, it is definitely efficient to just check the root
partition key, which will be sufficient.

Scenario 2 :
There are few non-leaf partitioned tables (3-4) :
Open those tables, and follow 2nd approach above: If we don't find any
updated partition-keys in any of them, well and good. If we do find,
failover to approach 3 : For each of the update resultrels, use the
new rd_partcheckattrs bitmap to know if it uses any of the updated
columns. This would be faster than pulling up attrs from the quals
like how it was done in the patch.

I think we should just have the planner figure out a list of which
columns are partitioning columns either for the named relation or some
descendent, and set a flag if that set of columns overlaps the set of
columns updated. At execution time, update tuple routing is needed if
either that flag is set or if some partition included in the plan has
a BR UPDATE trigger. Attached is a draft patch implementing that
approach.

This could be made more more accurate. Suppose table foo is
partitioned by a and some but not all of the partitions partitioned by
b. If it so happens that, in a query which only updates b, constraint
exclusion eliminates all of the partitions that are subpartitioned by
b, it would be unnecessary to enable update tuple routing (unless BR
UPDATE triggers are present) but this patch will not figure that out.
I don't think that optimization is critical for the first version of
this feature; there will be a limited number of users with
asymmetrical subpartitioning setups, and if one of them has an idea
how to improve this without hurting anything else, they are free to
contribute a patch. Other optimizations are possible too, but I don't
really see any of them as critical either.

I don't think the approach of building a hash table to figure out
which result rels have already been created is a good one. That too
feels like something that the planner should be figuring out and the
executor should just be implementing what the planner decided. I
haven't figured out exactly how that should work yet, but it seems
like it ought to be doable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

decide-whether-we-need-update-tuple-routing.patchapplication/octet-stream; name=decide-whether-we-need-update-tuple-routing.patchDownload

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 7da2058..534ed15 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -2072,6 +2072,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8d17425..c15253f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1794,6 +1794,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1865,6 +1866,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1902,6 +1912,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+	if (update_tuple_routing_needed)
+		elog(NOTICE, "update tuple routing is needed");
+
 	/* Build state for INSERT tuple routing */
 	if (operation == CMD_INSERT &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 67ac814..0c949a4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2256,6 +2257,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 91d64b7..15663d5 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3a23f0b..69b773f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -349,6 +349,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2067,6 +2068,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2489,6 +2491,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2988e8b..0532add 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1547,6 +1547,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index f087ddb..064af0f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1291,7 +1291,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	rte = planner_rt_fetch(rel->relid, root);
 	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, rel->relid);
+		partitioned_rels = get_partitioned_child_rels(root, rel->relid, NULL);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e589d92..7e4f058 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2357,6 +2358,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6398,6 +6400,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6424,6 +6427,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2988c11..bd99933 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1042,6 +1042,7 @@ inheritance_planner(PlannerInfo *root)
 	Index		rti;
 	RangeTblEntry *parent_rte;
 	List	   *partitioned_rels = NIL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1356,9 +1357,15 @@ inheritance_planner(PlannerInfo *root)
 
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
+		Bitmapset  *all_part_cols;
+
+		partitioned_rels = get_partitioned_child_rels(root, parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/* Result path must go into outer query's FINAL upperrel */
@@ -1415,6 +1422,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2032,6 +2040,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6062,10 +6071,15 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: Only call this function on RTEs known to be partitioned tables.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6077,6 +6091,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74..b52cf3b 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -1377,6 +1378,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	bool		need_append;
 	PartitionedChildRelInfo *pcinfo;
 	List	   *partitioned_child_rels = NIL;
+	Bitmapset  *all_part_cols = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1535,8 +1537,12 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			}
 		}
 		else
+		{
 			partitioned_child_rels = lappend_int(partitioned_child_rels,
 												 childRTindex);
+			pull_child_partition_columns(&all_part_cols, newrelation,
+										 oldrelation);
+		}
 
 		/*
 		 * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
@@ -1604,6 +1610,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 		pcinfo->parent_relid = rti;
 		pcinfo->child_rels = partitioned_child_rels;
+		pcinfo->all_part_cols = all_part_cols;
 		root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 	}
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f2d6385..f63edf4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3161,6 +3161,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3174,6 +3176,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3241,6 +3244,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index f10879a..058515e 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -98,4 +98,8 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
+
 #endif							/* PARTITION_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f1a1b24..cd670b9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9bae3c6..3013964 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2019,6 +2020,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2027,6 +2032,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 0c0549d..d35f448 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -235,6 +235,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */

#116

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Amit Khandekar (#113)

Re: UPDATE of partition key

On Fri, Jun 30, 2017 at 12:01 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 29 June 2017 at 07:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Amit,

On 2017/06/28 20:43, Amit Khandekar wrote:

In attached patch v12

The patch no longer applies and fails to compile after the following
commit was made yesterday:

commit 501ed02cf6f4f60c3357775eb07578aebc912d3a
Author: Andrew Gierth <rhodiumtoad@postgresql.org>
Date: Wed Jun 28 18:55:03 2017 +0100

Fix transition tables for partition/inheritance.

Thanks for informing Amit.

As Thomas mentioned upthread, the above commit already uses a tuple
conversion mapping from leaf partition to root partitioned table
(mt_transition_tupconv_maps), which serves the same purpose as that of
the mapping used in the update-partition-key patch during update tuple
routing (mt_resultrel_maps).

We need to try to merge these two into a general-purpose mapping array
such as mt_leaf_root_maps. I haven't done that in the rebased patch
(attached), so currently it has both these mapping fields.

For transition tables, this map is per-leaf-partition in case of
inserts, whereas it is per-subplan result rel for updates. For
update-tuple routing, the mapping is required to be per-subplan. Now,
for update-row-movement in presence of transition tables, we would
require both per-subplan mapping as well as per-leaf-partition
mapping, which can't be done if we have a single mapping field, unless
we have some way to identify which of the per-leaf partition mapping
elements belong to per-subplan rels.

So, it's not immediately possible to merge them.

Would make sense to have a set of functions with names like
GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
m_convertors_{from,to}_by_{subplan,leaf} the first time they need
them?

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Robert Haas (#115)

Re: UPDATE of partition key

On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think the approach of building a hash table to figure out
which result rels have already been created is a good one. That too
feels like something that the planner should be figuring out and the
executor should just be implementing what the planner decided. I
haven't figured out exactly how that should work yet, but it seems
like it ought to be doable.

I was imagining when I wrote the above that the planner should somehow
compute a list of relations that it has excluded so that the executor
can skip building ResultRelInfos for exactly those relations, but on
further study, that's not particularly easy to achieve and wouldn't
really save anything anyway, because the list of OIDs is coming
straight out of the partition descriptor, so it's pretty much free.
However, I still think it would be a nifty idea if we could avoid
needing the hash table to deduplicate. The reason we need that is, I
think, that expand_inherited_rtentry() is going to expand the
inheritance hierarchy in whatever order the scan(s) of pg_inherits
return the descendant tables, whereas the partition descriptor is
going to put them in a canonical order.

But that seems like it wouldn't be too hard to fix: let's have
expand_inherited_rtentry() expand the partitioned table in the same
order that will be used by ExecSetupPartitionTupleRouting(). That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs. Then - I think -
ExecSetupPartitionTupleRouting() doesn't need the hash table; it can
just scan through the return value of ExecSetupPartitionTupleRouting()
and the list of already-created ResultRelInfo structures in parallel -
the order must be the same, but the latter can be missing some
elements, so it can just create the missing ones.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Robert Haas (#117)

Re: UPDATE of partition key

On 2017/07/02 20:10, Robert Haas wrote:

On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think the approach of building a hash table to figure out
which result rels have already been created is a good one. That too
feels like something that the planner should be figuring out and the
executor should just be implementing what the planner decided. I
haven't figured out exactly how that should work yet, but it seems
like it ought to be doable.

I was imagining when I wrote the above that the planner should somehow
compute a list of relations that it has excluded so that the executor
can skip building ResultRelInfos for exactly those relations, but on
further study, that's not particularly easy to achieve and wouldn't
really save anything anyway, because the list of OIDs is coming
straight out of the partition descriptor, so it's pretty much free.
However, I still think it would be a nifty idea if we could avoid
needing the hash table to deduplicate. The reason we need that is, I
think, that expand_inherited_rtentry() is going to expand the
inheritance hierarchy in whatever order the scan(s) of pg_inherits
return the descendant tables, whereas the partition descriptor is
going to put them in a canonical order.

But that seems like it wouldn't be too hard to fix: let's have
expand_inherited_rtentry() expand the partitioned table in the same
order that will be used by ExecSetupPartitionTupleRouting(). That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs. Then - I think -
ExecSetupPartitionTupleRouting() doesn't need the hash table; it can
just scan through the return value of ExecSetupPartitionTupleRouting()
and the list of already-created ResultRelInfo structures in parallel -
the order must be the same, but the latter can be missing some
elements, so it can just create the missing ones.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work. Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables. We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

/*
* We keep the partitioned ones open until we're done using the
* information being collected here (for example, see
* ExecEndModifyTable).
*/

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 8 years ago

In reply to: Amit Langote (#118)

Re: UPDATE of partition key

On 2017/07/03 18:54, Amit Langote wrote:

On 2017/07/02 20:10, Robert Haas wrote:

But that seems like it wouldn't be too hard to fix: let's have
expand_inherited_rtentry() expand the partitioned table in the same
order that will be used by ExecSetupPartitionTupleRouting().

That's really what I wanted when updating the patch for tuple-routing to
foreign partitions. (I don't understand the issue discussed here, though.)

That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs.

Seems like a good idea.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work. Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables. We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

/*
* We keep the partitioned ones open until we're done using the
* information being collected here (for example, see
* ExecEndModifyTable).
*/

Yeah, we need some refactoring work. Is anyone working on that?

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Etsuro Fujita (#119)

Re: UPDATE of partition key

On 2017/07/04 17:25, Etsuro Fujita wrote:

On 2017/07/03 18:54, Amit Langote wrote:

On 2017/07/02 20:10, Robert Haas wrote:

That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs.

Seems like a good idea.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work. Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables. We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

/*
* We keep the partitioned ones open until we're done using the
* information being collected here (for example, see
* ExecEndModifyTable).
*/

Yeah, we need some refactoring work. Is anyone working on that?

I would like to take a shot at that if someone else hasn't already cooked
up a patch. Working on making RelationGetPartitionDispatchInfo() a
routine callable from both within the planner and the executor should be a
worthwhile effort.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Langote (#120)

Re: UPDATE of partition key

On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/07/04 17:25, Etsuro Fujita wrote:

On 2017/07/03 18:54, Amit Langote wrote:

On 2017/07/02 20:10, Robert Haas wrote:

That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs.

Seems like a good idea.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work. Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables. We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

/*
* We keep the partitioned ones open until we're done using the
* information being collected here (for example, see
* ExecEndModifyTable).
*/

Yeah, we need some refactoring work. Is anyone working on that?

I would like to take a shot at that if someone else hasn't already cooked
up a patch. Working on making RelationGetPartitionDispatchInfo() a
routine callable from both within the planner and the executor should be a
worthwhile effort.

What I am currently working on is to see if we can call
find_all_inheritors() or find_inheritance_children() instead of
generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
Possibly we don't have to refactor it completely.
find_inheritance_children() needs to return the oids in canonical
order. So in find_inheritance_children () need to re-use part of
RelationBuildPartitionDesc() where it generates those oids in that
order. I am checking this part, and am going to come up with an
approach based on findings.

Also, need to investigate whether *always* sorting the oids in
canonical order is going to be much expensive than the current sorting
using oids. But I guess it won't be that expensive.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#121)

Re: UPDATE of partition key

On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/07/04 17:25, Etsuro Fujita wrote:

On 2017/07/03 18:54, Amit Langote wrote:

On 2017/07/02 20:10, Robert Haas wrote:

That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs.

Seems like a good idea.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work. Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables. We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

/*
* We keep the partitioned ones open until we're done using the
* information being collected here (for example, see
* ExecEndModifyTable).
*/

Yeah, we need some refactoring work. Is anyone working on that?

I would like to take a shot at that if someone else hasn't already cooked
up a patch. Working on making RelationGetPartitionDispatchInfo() a
routine callable from both within the planner and the executor should be a
worthwhile effort.

What I am currently working on is to see if we can call
find_all_inheritors() or find_inheritance_children() instead of
generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
Possibly we don't have to refactor it completely.
find_inheritance_children() needs to return the oids in canonical
order. So in find_inheritance_children () need to re-use part of
RelationBuildPartitionDesc() where it generates those oids in that
order. I am checking this part, and am going to come up with an
approach based on findings.

The other approach is to make canonical ordering only in
find_all_inheritors() by replacing call to find_inheritance_children()
with the refactored/modified RelationGetPartitionDispatchInfo(). But
that would mean that the callers of find_inheritance_children() would
have one ordering, while the callers of find_all_inheritors() would
have a different ordering; that brings up chances of deadlocks. That's
why I think, we need to think about modifying the common function
find_inheritance_children(), so that we would be consistent with the
ordering. And then use find_inheritance_children() or
find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
there would be some modifications to
RelationGetPartitionDispatchInfo().

Also, need to investigate whether *always* sorting the oids in
canonical order is going to be much expensive than the current sorting
using oids. But I guess it won't be that expensive.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#122)

Re: UPDATE of partition key

On 4 July 2017 at 15:23, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 July 2017 at 14:48, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 July 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/07/04 17:25, Etsuro Fujita wrote:

On 2017/07/03 18:54, Amit Langote wrote:

On 2017/07/02 20:10, Robert Haas wrote:

That
seems pretty easy to do - just have expand_inherited_rtentry() notice
that it's got a partitioned table and call
RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
produce the list of OIDs.

Seems like a good idea.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work. Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables. We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

/*
* We keep the partitioned ones open until we're done using the
* information being collected here (for example, see
* ExecEndModifyTable).
*/

Yeah, we need some refactoring work. Is anyone working on that?

I would like to take a shot at that if someone else hasn't already cooked
up a patch. Working on making RelationGetPartitionDispatchInfo() a
routine callable from both within the planner and the executor should be a
worthwhile effort.

What I am currently working on is to see if we can call
find_all_inheritors() or find_inheritance_children() instead of
generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
Possibly we don't have to refactor it completely.
find_inheritance_children() needs to return the oids in canonical
order. So in find_inheritance_children () need to re-use part of
RelationBuildPartitionDesc() where it generates those oids in that
order. I am checking this part, and am going to come up with an
approach based on findings.

The other approach is to make canonical ordering only in
find_all_inheritors() by replacing call to find_inheritance_children()
with the refactored/modified RelationGetPartitionDispatchInfo(). But
that would mean that the callers of find_inheritance_children() would
have one ordering, while the callers of find_all_inheritors() would
have a different ordering; that brings up chances of deadlocks. That's
why I think, we need to think about modifying the common function
find_inheritance_children(), so that we would be consistent with the
ordering. And then use find_inheritance_children() or
find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
there would be some modifications to
RelationGetPartitionDispatchInfo().

Also, need to investigate whether *always* sorting the oids in
canonical order is going to be much expensive than the current sorting
using oids. But I guess it won't be that expensive.

Like I mentioned upthread... in expand_inherited_rtentry(), if we
replace find_all_inheritors() with something else that returns oids in
canonical order, that will change the order in which children tables
get locked, which increases the chance of deadlock. Because, then the
callers of find_all_inheritors() will lock them in one order, while
callers of expand_inherited_rtentry() will lock them in a different
order. Even in the current code, I think there is a chance of
deadlocks because RelationGetPartitionDispatchInfo() and
find_all_inheritors() have different lock ordering.

Now, to get the oids of a partitioned table children sorted by
canonical ordering, (i.e. using the partition bound values) we need to
either use the partition bounds to sort the oids like the way it is
done in RelationBuildPartitionDesc() or, open the parent table and get
it's Relation->rd_partdesc->oids[] which are already sorted in
canonical order. So if we generate oids using this way in
find_all_inheritors() and find_inheritance_children(), that will
generate consistent ordering everywhere. But this method is quite
expensive as compared to the way oids are generated and sorted using
oid values in find_inheritance_children().

In both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(), each of the child tables are
opened.

So, in both of these functions, what we can do is : call a new
function partition_tree_walker() which does following :
1. Lock the children using the existing order (i.e. sorted by oid
values) using the same function find_all_inheritors(). Rename
find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
which returns the oid list only if requested.
2. And then scan through each of the partitions in canonical order, by
opening the parent table, then opening the partition descriptor oids,
and then doing whatever needs to be done with that partition rel.

partition_tree_walker() will look something like this :

void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,
void (*walker_func) (), void *context)
{
Relation parentrel;
List *rels_list;
ListCell *cell;

(void) lock_all_inheritors(parentOid, lockmode,
false /* don't generate oids */);

parentrel = heap_open(parentOid, NoLock);
rels_list = append_rel_partition_oids(NIL, parentrel);

/* Scan through all partitioned rels, and at the
* same time append their children. */
foreach(cell, rels_list)
{
/* Open partrel without locking; lock_all_inheritors() has locked it */
Relation partrel = heap_open(lfirst_oid(cell), NoLock);

/* Append the children of a partitioned rel to the same list
* that we are iterating on */
if (RelationGetPartitionDesc(partrel))
rels_list = append_rel_partition_oids(rels_list, partrel);

/*
* Do whatever processing needs to be done on this partel.
* The walker function is free to either close the partel
* or keep it opened, but it needs to make sure the opened
* ones are closed later
*/
walker_func(partrel, context);
}
}

List *append_rel_partition_oids(List *rel_list, Relation rel)
{
int i;
for (i = 0; i < rel->rd_partdesc->nparts; i++)
rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]);

return rel_list;
}

So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
where expand_rte_walker_func() will do all the work done in the for
loop for each of the partition rels.

Similarly, in RelationGetPartitionDispatchInfo() the initial part
where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
partition_tree_walker(rel, dispatch_info_walkerfunc) where
dispatch_info_walkerfunc() will generate the oids, or may be populate
the complete PartitionDispatchData structure. 'pd' variable can be
passed as context to the partition_tree_walker(..., context)

Generating the resultrels in canonical order by opening the tables
using the above way wouldn't be more expensive than the existing code,
because even currently we anyways have to open all the tables in both
of these functions.

-Amit Khandekar

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#123)

1 attachment(s)

Re: UPDATE of partition key

On 5 July 2017 at 15:12, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Like I mentioned upthread... in expand_inherited_rtentry(), if we
replace find_all_inheritors() with something else that returns oids in
canonical order, that will change the order in which children tables
get locked, which increases the chance of deadlock. Because, then the
callers of find_all_inheritors() will lock them in one order, while
callers of expand_inherited_rtentry() will lock them in a different
order. Even in the current code, I think there is a chance of
deadlocks because RelationGetPartitionDispatchInfo() and
find_all_inheritors() have different lock ordering.

Now, to get the oids of a partitioned table children sorted by
canonical ordering, (i.e. using the partition bound values) we need to
either use the partition bounds to sort the oids like the way it is
done in RelationBuildPartitionDesc() or, open the parent table and get
it's Relation->rd_partdesc->oids[] which are already sorted in
canonical order. So if we generate oids using this way in
find_all_inheritors() and find_inheritance_children(), that will
generate consistent ordering everywhere. But this method is quite
expensive as compared to the way oids are generated and sorted using
oid values in find_inheritance_children().

In both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(), each of the child tables are
opened.

So, in both of these functions, what we can do is : call a new
function partition_tree_walker() which does following :
1. Lock the children using the existing order (i.e. sorted by oid
values) using the same function find_all_inheritors(). Rename
find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
which returns the oid list only if requested.
2. And then scan through each of the partitions in canonical order, by
opening the parent table, then opening the partition descriptor oids,
and then doing whatever needs to be done with that partition rel.

partition_tree_walker() will look something like this :

void partition_tree_walker(Oid parentOid, LOCKMODE lockmode,
void (*walker_func) (), void *context)
{
Relation parentrel;
List *rels_list;
ListCell *cell;

(void) lock_all_inheritors(parentOid, lockmode,
false /* don't generate oids */);

parentrel = heap_open(parentOid, NoLock);
rels_list = append_rel_partition_oids(NIL, parentrel);

/* Scan through all partitioned rels, and at the
* same time append their children. */
foreach(cell, rels_list)
{
/* Open partrel without locking; lock_all_inheritors() has locked it */
Relation partrel = heap_open(lfirst_oid(cell), NoLock);

/* Append the children of a partitioned rel to the same list
* that we are iterating on */
if (RelationGetPartitionDesc(partrel))
rels_list = append_rel_partition_oids(rels_list, partrel);

/*
* Do whatever processing needs to be done on this partel.
* The walker function is free to either close the partel
* or keep it opened, but it needs to make sure the opened
* ones are closed later
*/
walker_func(partrel, context);
}
}

List *append_rel_partition_oids(List *rel_list, Relation rel)
{
int i;
for (i = 0; i < rel->rd_partdesc->nparts; i++)
rel_list = lappend_oid(rel_list, rel->rd_partdesc->oids[i]);

return rel_list;
}

So, in expand_inherited_rtentry() the foreach(l, inhOIDs) loop will be
replaced by partition_tree_walker(parentOid, expand_rte_walker_func)
where expand_rte_walker_func() will do all the work done in the for
loop for each of the partition rels.

Similarly, in RelationGetPartitionDispatchInfo() the initial part
where it uses APPEND_REL_PARTITION_OIDS() can be replaced by
partition_tree_walker(rel, dispatch_info_walkerfunc) where
dispatch_info_walkerfunc() will generate the oids, or may be populate
the complete PartitionDispatchData structure. 'pd' variable can be
passed as context to the partition_tree_walker(..., context)

Generating the resultrels in canonical order by opening the tables
using the above way wouldn't be more expensive than the existing code,
because even currently we anyways have to open all the tables in both
of these functions.

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.

In this patch, rather than partition_tree_walker() called with a
context, I have provided a function partition_walker_next() using
which we iterate over all the partitions in canonical order.
partition_walker_next() will take care of appending oids from
partition descriptors.

Now, to generate consistent oid ordering in
RelationGetPartitionDispatchInfo() and expand_inherited_rtentry(), we
could have very well skipped using the partition_walker API in
expand_inherited_rtentry() and just had it iterate over the partition
descriptors the way it is done in RelationGetPartitionDispatchInfo().
But I think it's better to have some common function to traverse the
partition tree in consistent order, hence the usage of
partition_walker_next() in both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(). In
RelationGetPartitionDispatchInfo(), still, it only uses this function
to generate partitioned table list. But even to generate partitioned
tables in correct order, it is better to use partition_walker_next(),
so that we make sure to finally generate consistent order of leaf
oids.

I considered the option where RelationGetPartitionDispatchInfo() would
directly build the pd[] array over each iteration of
partition_walker_next(). But that was turning out to be clumsy,
because then we need to keep track of which pd[] element each of the
oids would go into by having a current position of pd[]. Rather than
this, it is best to keep building of pd array separate, as done in the
existing code.

Didn't do any renaming for find_all_inheritors(). Just called it in
both the functions, and ignored the list returned. Like mentioned
upthread, it is important to lock in this order so as to be
consistent with the lock ordering in other places where
find_inheritance_children() is called. Hence, called
find_all_inheritors() in RelationGetPartitionDispatchInfo() as well.

Note that this patch does not attempt to make
RelationGetPartitionDispatchInfo() work in planner. That I think
should be done once we finalise how to generate common oid ordering,
and is not in the scope of this project.

Once I merge this in the update-partition-key patch, in
ExecSetupPartitionTupleRouting(), I will be able to search for the
leaf partitions in this ordered resultrel list, without having to
build a hash table of result rels the way it is currently done in the
update-partition-key patch.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

make_resultrels_ordered.patchapplication/octet-stream; name=make_resultrels_ordered.patchDownload

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 7da2058..0d64adf 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -113,6 +113,16 @@ typedef struct PartitionRangeBound
 	bool		lower;			/* this is the lower (vs upper) bound */
 } PartitionRangeBound;
 
+/*
+ * List of these elements is prepared while traversing a partition tree,
+ * so as to get a consistent order of partitions.
+ */
+typedef struct ParentChild
+{
+	Oid         reloid;
+	Relation    parent;			/* Parent relation of reloid */
+} ParentChild;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -148,6 +158,8 @@ static int partition_bound_bsearch(PartitionKey key,
 						PartitionBoundInfo boundinfo,
 						void *probe, bool probe_is_bound, bool *is_equal);
 
+static List *append_rel_partition_oids(List *rel_list, Relation rel);
+
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -999,20 +1011,9 @@ get_partition_qual_relid(Oid relid)
 	return result;
 }
 
-/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-	do\
-	{\
-		int		i;\
-		for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-		{\
-			(partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
-			(parents) = lappend((parents), (rel));\
-		}\
-	} while(0)
+#ifdef DEBUG_PRINT_OIDS
+static void print_oids(List *oid_list);
+#endif
 
 /*
  * RelationGetPartitionDispatchInfo
@@ -1026,11 +1027,13 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 								 int *num_parted, List **leaf_part_oids)
 {
+	PartitionWalker walker;
 	PartitionDispatchData **pd;
-	List	   *all_parts = NIL,
-			   *all_parents = NIL,
-			   *parted_rels,
+	Relation	partrel;
+	Relation	parent;
+	List	   *parted_rels,
 			   *parted_rel_parents;
+	List	   *inhOIDs;
 	ListCell   *lc1,
 			   *lc2;
 	int			i,
@@ -1041,21 +1044,28 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 	 * Lock partitions and make a list of the partitioned ones to prepare
 	 * their PartitionDispatch objects below.
 	 *
-	 * Cannot use find_all_inheritors() here, because then the order of OIDs
-	 * in parted_rels list would be unknown, which does not help, because we
-	 * assign indexes within individual PartitionDispatch in an order that is
-	 * predetermined (determined by the order of OIDs in individual partition
-	 * descriptors).
+	 * Must call find_all_inheritors() here so as to lock the partitions in a
+	 * consistent order (by oid values) to prevent deadlocks. But we assign
+	 * indexes within individual PartitionDispatch in a different order
+	 * (determined by the order of OIDs in individual partition descriptors).
+	 * So, rather than using the oids returned by find_all_inheritors(), we
+	 * generate canonically ordered oids using partition walker.
 	 */
+	inhOIDs = find_all_inheritors(RelationGetRelid(rel), lockmode, NULL);
+	list_free(inhOIDs);
+
+	partition_walker_init(&walker, rel);
+	parent = NULL;
 	*num_parted = 1;
 	parted_rels = list_make1(rel);
 	/* Root partitioned table has no parent, so NULL for parent */
 	parted_rel_parents = list_make1(NULL);
-	APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
-	forboth(lc1, all_parts, lc2, all_parents)
+
+	/* Go to the next partition */
+	partrel = partition_walker_next(&walker, &parent);
+
+	for (; partrel != NULL; partrel = partition_walker_next(&walker, &parent))
 	{
-		Relation	partrel = heap_open(lfirst_oid(lc1), lockmode);
-		Relation	parent = lfirst(lc2);
 		PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
 
 		/*
@@ -1067,7 +1077,6 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 			(*num_parted)++;
 			parted_rels = lappend(parted_rels, partrel);
 			parted_rel_parents = lappend(parted_rel_parents, parent);
-			APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
 		}
 		else
 			heap_close(partrel, NoLock);
@@ -1171,6 +1180,10 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 		offset += m;
 	}
 
+#ifdef DEBUG_PRINT_OIDS
+	print_oids(*leaf_part_oids);
+#endif
+
 	return pd;
 }
 
@@ -2331,3 +2344,100 @@ partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo,
 
 	return lo;
 }
+
+/*
+ * partition_walker_init
+ *
+ * Using the passed partitioned relation, expand it into its partitions using
+ * its partition descriptor, and make a partition rel list out of those. The
+ * rel passed in itself is not kept part of the partition list. The caller
+ * should handle the first rel separately before calling this function.
+ */
+void
+partition_walker_init(PartitionWalker *walker, Relation rel)
+{
+	memset(walker, 0, sizeof(PartitionWalker));
+
+	if (RelationGetPartitionDesc(rel))
+		walker->rels_list = append_rel_partition_oids(walker->rels_list, rel);
+
+	/* Assign the first one as the current partition cell */
+	walker->cur_cell = list_head(walker->rels_list);
+}
+
+/*
+ * partition_walker_next
+ *
+ * Get the next partition in the partition tree.
+ * At the same time, if the partition is a partitioned table, append its
+ * children at the end, so that the next time we can traverse through these.
+ */
+Relation
+partition_walker_next(PartitionWalker *walker, Relation *parent)
+{
+	ParentChild	   *pc;
+	Relation    partrel;
+
+	if (walker->cur_cell == NULL)
+		return NULL;
+
+	pc = (ParentChild *) lfirst(walker->cur_cell);
+	if (parent)
+		*parent = pc->parent;
+
+	/* Open partrel without locking; find_all_inheritors() has locked it */
+	partrel = heap_open(pc->reloid, NoLock);
+
+	/*
+	 * Append the children of partrel to the same list that we are
+	 * iterating on.
+	 */
+	if (RelationGetPartitionDesc(partrel))
+		walker->rels_list = append_rel_partition_oids(walker->rels_list,
+													  partrel);
+
+	/* Bump the cur_cell here at the end, because above, we modify the list */
+	walker->cur_cell = lnext(walker->cur_cell);
+
+	return partrel;
+}
+
+/*
+ * append_rel_partition_oids
+ *
+ * Append OIDs of rel's partitions to the list 'rel_list' and for each OID,
+ * also store parent rel.
+ */
+static
+List *append_rel_partition_oids(List *rel_list, Relation rel)
+{
+	int		i;
+	PartitionDescData *partdesc = RelationGetPartitionDesc(rel);
+
+	Assert(partdesc);
+
+	for (i = 0; i < partdesc->nparts; i++)
+	{
+		ParentChild *pc = palloc(sizeof(ParentChild));
+		pc->parent = rel;
+		pc->reloid = rel->rd_partdesc->oids[i];
+		rel_list = lappend(rel_list, pc);
+	}
+	return rel_list;
+}
+
+#ifdef DEBUG_PRINT_OIDS
+static void
+print_oids(List *oid_list)
+{
+	ListCell   *cell;
+	StringInfoData oids_str;
+
+	initStringInfo(&oids_str);
+	foreach(cell, oid_list)
+	{
+		appendStringInfo(&oids_str, "%s ", get_rel_name(lfirst_oid(cell)));
+	}
+	elog(NOTICE, "leaf oids: %s", oids_str.data);
+}
+#endif
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74..e9856c4 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -99,6 +100,8 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
 static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
 						 Index rti);
+static Relation get_next_child(Relation oldrelation, ListCell **cell,
+						PartitionWalker *walker, bool is_partitioned);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
 						  Index newvarno,
@@ -1370,12 +1373,15 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	Oid			parentOID;
 	PlanRowMark *oldrc;
 	Relation	oldrelation;
+	Relation	newrelation;
 	LOCKMODE	lockmode;
 	List	   *inhOIDs;
 	List	   *appinfos;
-	ListCell   *l;
+	ListCell   *oids_cell;
 	bool		need_append;
+	bool		is_partitioned_resultrel;
 	PartitionedChildRelInfo *pcinfo;
+	PartitionWalker walker;
 	List	   *partitioned_child_rels = NIL;
 
 	/* Does RT entry allow inheritance? */
@@ -1446,23 +1452,54 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	 */
 	oldrelation = heap_open(parentOID, NoLock);
 
+	/*
+	 * Remember whether it is a result relation and it is partitioned. We need
+	 * to decide the ordering of result rels based on this.
+	 */
+	is_partitioned_resultrel =
+		(oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE
+		 && rti == parse->resultRelation);
+
 	/* Scan the inheritance set and expand it */
 	appinfos = NIL;
 	need_append = false;
-	foreach(l, inhOIDs)
+	newrelation = oldrelation;
+
+	/* For non-partitioned result-rels, open the first child from inhOIDs */
+	if (!is_partitioned_resultrel)
+	{
+		oids_cell = list_head(inhOIDs);
+		newrelation = get_next_child(oldrelation, &oids_cell, &walker,
+									 is_partitioned_resultrel);
+	}
+	else
 	{
-		Oid			childOID = lfirst_oid(l);
-		Relation	newrelation;
+		/*
+		 * For partitioned resultrels, we don't need the inhOIDs list itself,
+		 * because we anyways traverse the tree in canonical order; but we do
+		 * want to lock all the children in a consistent order (see
+		 * find_inheritance_children), so as to avoid unnecessary deadlocks.
+		 * Hence, the call to find_all_inheritors() above. The aim is to
+		 * generate the appinfos in canonical order so that the result rels,
+		 * if generated later, are in the same order as those of the leaf
+		 * partitions that are maintained during insert/update tuple routing.
+		 * Maintaining same order would speed up searching for a given leaf
+		 * partition in these result rels.
+		 */
+		list_free(inhOIDs);
+		inhOIDs = NIL;
+		partition_walker_init(&walker, oldrelation);
+	}
+
+	for (; newrelation != NULL;
+		 newrelation = get_next_child(oldrelation, &oids_cell, &walker,
+									  is_partitioned_resultrel))
+	{
+		Oid			childOID = RelationGetRelid(newrelation);
 		RangeTblEntry *childrte;
 		Index		childRTindex;
 		AppendRelInfo *appinfo;
 
-		/* Open rel if needed; we already have required locks */
-		if (childOID != parentOID)
-			newrelation = heap_open(childOID, NoLock);
-		else
-			newrelation = oldrelation;
-
 		/*
 		 * It is possible that the parent table has children that are temp
 		 * tables of other backends.  We cannot safely access such tables
@@ -1575,6 +1612,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 
 	heap_close(oldrelation, NoLock);
 
+#ifdef DEBUG_PRINT_OIDS
+	print_oids(appinfos, parse->rtable);
+#endif
+
 	/*
 	 * If all the children were temp tables or a partitioned parent did not
 	 * have any leaf partitions, pretend it's a non-inheritance situation; we
@@ -1612,6 +1653,45 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 }
 
 /*
+ * Get the next child in an inheritance tree.
+ *
+ * This function is called to traverse two different types of lists. If it's a
+ * list containing partitions, is_partitioned is true, and 'walker' is valid.
+ * Otherwise, 'cell' points to a position in the list of inheritance children.
+ * For partitions walker, the partition traversal is done using canonical
+ * ordering. Whereas, for inheritence children, list is already prepared, and
+ * is ordered depending upon the pg_inherit scan.
+ *
+ * oldrelation is the root relation in the inheritence tree. This is unused in
+ * case of is_partitioned=true.
+ */
+static Relation
+get_next_child(Relation oldrelation, ListCell **cell, PartitionWalker *walker,
+			   bool is_partitioned)
+{
+	if (is_partitioned)
+		return partition_walker_next(walker, NULL);
+	else
+	{
+		Oid		childOID;
+
+		if (!*cell)
+			return NULL; /* We are done with the list */
+
+		childOID = lfirst_oid(*cell);
+
+		/* Prepare to get the next child. */
+		*cell = lnext(*cell);
+
+		/* If it's the root relation, it is already open */
+		if (childOID != RelationGetRelid(oldrelation))
+			return heap_open(childOID, NoLock);
+		else
+			return oldrelation;
+	}
+}
+
+/*
  * make_inh_translation_list
  *	  Build the list of translations from parent Vars to child Vars for
  *	  an inheritance child.
@@ -2161,3 +2241,21 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
 	/* Now translate for this child */
 	return adjust_appendrel_attrs(root, node, appinfo);
 }
+
+#ifdef DEBUG_PRINT_OIDS
+static void
+print_oids(List *oid_list, List *rtable)
+{
+	ListCell   *cell;
+	StringInfoData oids_str;
+
+	initStringInfo(&oids_str);
+	foreach(cell, oid_list)
+	{
+		AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(cell);
+		RangeTblEntry *childrte = (RangeTblEntry *) list_nth(rtable, appinfo->child_relid-1);
+		appendStringInfo(&oids_str, "%s ", get_rel_name(childrte->relid));
+	}
+	elog(NOTICE, "expanded oids: %s", oids_str.data);
+}
+#endif
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index f10879a..2662850 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -84,6 +90,10 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int lockmode, int *num_parted,

#125

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#124)

1 attachment(s)

Re: UPDATE of partition key

On 13 July 2017 at 22:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.

So now that the per-subplan result rels and the leaf partition oids
that are generated for tuple routing are both known to have the same
order (cannonical), in ExecSetupPartitionTupleRouting(), we look for
the per-subplan results without the need for a hash table. Instead of
the hash table, we iterate over the leaf partition oids and at the
same time keep shifting a position over the per-subplan resultrels
whenever the resultrel at the position is found to be present in the
leaf partitions list. Since the two lists are in the same order, we
never have to again scan the portition of the lists that is already
scanned.

I considered whether the issue behind this recent commit might be
relevant for update tuple-routing as well :
commit f81a91db4d1c2032632aa5df9fc14be24f5fe5ec
Author: Robert Haas <rhaas@postgresql.org>
Date: Mon Jul 17 21:29:45 2017 -0400
Use a real RT index when setting up partition tuple routing.

Since we know that using a dummy 1 value for tuple routing result rels
is not correct, I am checking about another possibility : Now in the
latest patch, the tuple routing partitions would have a mix of a)
existing update result-rels, and b) new partition resultrels. 'b'
resultrels would have the RT index of nominalRelation, but the
existing 'a' resultrels would have their own different RT indexes. I
suspect, this might surface a similar issue that was fixed by the
above commit, for e.g. with the WITH query having UPDATE subqueries
doing tuple routing. Will check that.

This patch also has Robert's changes in the planner to decide whether
to do update tuple routing.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v13.patchapplication/octet-stream; name=update-partition-key_v13.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 74736e0..4bd8485 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,16 @@ typedef struct PartitionRangeBound
 	bool		lower;			/* this is the lower (vs upper) bound */
 } PartitionRangeBound;
 
+/*
+ * List of these elements is prepared while traversing a partition tree,
+ * so as to get a consistent order of partitions.
+ */
+typedef struct ParentChild
+{
+	Oid         reloid;
+	Relation    parent;			/* Parent relation of reloid */
+} ParentChild;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -140,6 +150,8 @@ static int partition_bound_bsearch(PartitionKey key,
 						PartitionBoundInfo boundinfo,
 						void *probe, bool probe_is_bound, bool *is_equal);
 
+static List *append_rel_partition_oids(List *rel_list, Relation rel);
+
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -893,7 +905,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -903,8 +916,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent)
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel)
 {
 	AttrNumber *part_attnos;
 	bool		found_whole_row;
@@ -912,13 +925,13 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 											 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
+										RelationGetDescr(from_rel)->natts,
 										&found_whole_row);
 	/* There can never be a whole-row reference here */
 	if (found_whole_row)
@@ -971,20 +984,9 @@ get_partition_qual_relid(Oid relid)
 	return result;
 }
 
-/*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-	do\
-	{\
-		int		i;\
-		for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-		{\
-			(partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
-			(parents) = lappend((parents), (rel));\
-		}\
-	} while(0)
+#ifdef DEBUG_PRINT_OIDS
+static void print_oids(List *oid_list);
+#endif
 
 /*
  * RelationGetPartitionDispatchInfo
@@ -998,11 +1000,13 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 								 int *num_parted, List **leaf_part_oids)
 {
+	PartitionWalker walker;
 	PartitionDispatchData **pd;
-	List	   *all_parts = NIL,
-			   *all_parents = NIL,
-			   *parted_rels,
+	Relation	partrel;
+	Relation	parent;
+	List	   *parted_rels,
 			   *parted_rel_parents;
+	List	   *inhOIDs;
 	ListCell   *lc1,
 			   *lc2;
 	int			i,
@@ -1013,21 +1017,28 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 	 * Lock partitions and make a list of the partitioned ones to prepare
 	 * their PartitionDispatch objects below.
 	 *
-	 * Cannot use find_all_inheritors() here, because then the order of OIDs
-	 * in parted_rels list would be unknown, which does not help, because we
-	 * assign indexes within individual PartitionDispatch in an order that is
-	 * predetermined (determined by the order of OIDs in individual partition
-	 * descriptors).
+	 * Must call find_all_inheritors() here so as to lock the partitions in a
+	 * consistent order (by oid values) to prevent deadlocks. But we assign
+	 * indexes within individual PartitionDispatch in a different order
+	 * (determined by the order of OIDs in individual partition descriptors).
+	 * So, rather than using the oids returned by find_all_inheritors(), we
+	 * generate canonically ordered oids using partition walker.
 	 */
+	inhOIDs = find_all_inheritors(RelationGetRelid(rel), lockmode, NULL);
+	list_free(inhOIDs);
+
+	partition_walker_init(&walker, rel);
+	parent = NULL;
 	*num_parted = 1;
 	parted_rels = list_make1(rel);
 	/* Root partitioned table has no parent, so NULL for parent */
 	parted_rel_parents = list_make1(NULL);
-	APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
-	forboth(lc1, all_parts, lc2, all_parents)
+
+	/* Go to the next partition */
+	partrel = partition_walker_next(&walker, &parent);
+
+	for (; partrel != NULL; partrel = partition_walker_next(&walker, &parent))
 	{
-		Relation	partrel = heap_open(lfirst_oid(lc1), lockmode);
-		Relation	parent = lfirst(lc2);
 		PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
 
 		/*
@@ -1039,7 +1050,6 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 			(*num_parted)++;
 			parted_rels = lappend(parted_rels, partrel);
 			parted_rel_parents = lappend(parted_rel_parents, parent);
-			APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
 		}
 		else
 			heap_close(partrel, NoLock);
@@ -1143,6 +1153,10 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 		offset += m;
 	}
 
+#ifdef DEBUG_PRINT_OIDS
+	print_oids(*leaf_part_oids);
+#endif
+
 	return pd;
 }
 
@@ -2052,6 +2066,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
@@ -2318,3 +2403,100 @@ partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo,
 
 	return lo;
 }
+
+/*
+ * partition_walker_init
+ *
+ * Using the passed partitioned relation, expand it into its partitions using
+ * its partition descriptor, and make a partition rel list out of those. The
+ * rel passed in itself is not kept part of the partition list. The caller
+ * should handle the first rel separately before calling this function.
+ */
+void
+partition_walker_init(PartitionWalker *walker, Relation rel)
+{
+	memset(walker, 0, sizeof(PartitionWalker));
+
+	if (RelationGetPartitionDesc(rel))
+		walker->rels_list = append_rel_partition_oids(walker->rels_list, rel);
+
+	/* Assign the first one as the current partition cell */
+	walker->cur_cell = list_head(walker->rels_list);
+}
+
+/*
+ * partition_walker_next
+ *
+ * Get the next partition in the partition tree.
+ * At the same time, if the partition is a partitioned table, append its
+ * children at the end, so that the next time we can traverse through these.
+ */
+Relation
+partition_walker_next(PartitionWalker *walker, Relation *parent)
+{
+	ParentChild	   *pc;
+	Relation    partrel;
+
+	if (walker->cur_cell == NULL)
+		return NULL;
+
+	pc = (ParentChild *) lfirst(walker->cur_cell);
+	if (parent)
+		*parent = pc->parent;
+
+	/* Open partrel without locking; find_all_inheritors() has locked it */
+	partrel = heap_open(pc->reloid, NoLock);
+
+	/*
+	 * Append the children of partrel to the same list that we are
+	 * iterating on.
+	 */
+	if (RelationGetPartitionDesc(partrel))
+		walker->rels_list = append_rel_partition_oids(walker->rels_list,
+													  partrel);
+
+	/* Bump the cur_cell here at the end, because above, we modify the list */
+	walker->cur_cell = lnext(walker->cur_cell);
+
+	return partrel;
+}
+
+/*
+ * append_rel_partition_oids
+ *
+ * Append OIDs of rel's partitions to the list 'rel_list' and for each OID,
+ * also store parent rel.
+ */
+static
+List *append_rel_partition_oids(List *rel_list, Relation rel)
+{
+	int		i;
+	PartitionDescData *partdesc = RelationGetPartitionDesc(rel);
+
+	Assert(partdesc);
+
+	for (i = 0; i < partdesc->nparts; i++)
+	{
+		ParentChild *pc = palloc(sizeof(ParentChild));
+		pc->parent = rel;
+		pc->reloid = rel->rd_partdesc->oids[i];
+		rel_list = lappend(rel_list, pc);
+	}
+	return rel_list;
+}
+
+#ifdef DEBUG_PRINT_OIDS
+static void
+print_oids(List *oid_list)
+{
+	ListCell   *cell;
+	StringInfoData oids_str;
+
+	initStringInfo(&oids_str);
+	foreach(cell, oid_list)
+	{
+		appendStringInfo(&oids_str, "%s ", get_rel_name(lfirst_oid(cell)));
+	}
+	elog(NOTICE, "leaf oids: %s", oids_str.data);
+}
+#endif
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 53e2965..6fb3ed6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -1426,13 +1426,15 @@ BeginCopy(ParseState *pstate,
 		if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		{
 			PartitionDispatch *partition_dispatch_info;
-			ResultRelInfo *partitions;
+			ResultRelInfo **partitions;
 			TupleConversionMap **partition_tupconv_maps;
 			TupleTableSlot *partition_tuple_slot;
 			int			num_parted,
 						num_partitions;
 
 			ExecSetupPartitionTupleRouting(rel,
+										   NULL,
+										   0,
 										   1,
 										   &partition_dispatch_info,
 										   &partitions,
@@ -1462,7 +1464,7 @@ BeginCopy(ParseState *pstate,
 				for (i = 0; i < cstate->num_partitions; ++i)
 				{
 					cstate->transition_tupconv_maps[i] =
-						convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+						convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 											   RelationGetDescr(rel),
 											   gettext_noop("could not convert row type"));
 				}
@@ -2609,7 +2611,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2718,7 +2720,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2838,7 +2840,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b22de78..7b22baf 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -64,6 +64,18 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+/*
+ * Entry of a temporary hash table. During UPDATE tuple routing, we want to
+ * know which of the leaf partitions are present in the UPDATE per-subplan
+ * resultRelInfo array (ModifyTableState->resultRelInfo[]). This hash table
+ * is searchable by the oids of the subplan result rels.
+ */
+typedef struct ResultRelOidsEntry
+{
+	Oid			rel_oid;
+	ResultRelInfo *resultRelInfo;
+} ResultRelOidsEntry;
+
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
@@ -103,8 +115,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
 
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
@@ -1823,15 +1833,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1859,51 +1864,65 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1911,7 +1930,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2024,8 +2044,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -2112,6 +2133,7 @@ ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 						if (map != NULL)
 						{
 							tuple = do_convert_tuple(tuple, map);
+							ExecSetSlotDescriptor(slot, tupdesc);
 							ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 						}
 					}
@@ -3209,10 +3231,14 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
  *		entry for every leaf partition (required to convert input tuple based
@@ -3232,9 +3258,11 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
@@ -3243,17 +3271,45 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/* Get the tuple-routing information and lock partitions */
 	*pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
 										   &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+										   sizeof(ResultRelInfo*));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by one
+		 * while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid repeated
+		 * pallocs by allocating memory for all the result rels in bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -3262,23 +3318,75 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root partition
+				 * tuple descriptor. When generating the update plans, this was
+				 * not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf partitions.
+			 * Note that each of the newly opened relations in *partitions are
+			 * eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  0);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
-		 * Verify result relation is a valid target for the current operation.
+		 * Verify result relation is a valid target for insert operation.
+		 * Even for updates, we are doing this for tuple-routing, so again,
+		 * we need to check the validity for insert operation.
 		 */
 		CheckValidResultRel(partrel, CMD_INSERT);
 
@@ -3289,12 +3397,6 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  0);
-
 		/*
 		 * Open partition indices (remember we do not support ON CONFLICT in
 		 * case of partitioned tables, so we do not need support information
@@ -3304,9 +3406,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 			leaf_part_rri->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(leaf_part_rri, false);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to the
+	 * last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
@@ -3332,8 +3443,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple it if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index bc53d07..eca60f2 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -402,7 +402,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -467,7 +467,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 77ba15d..9f660e7 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,6 +54,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
@@ -239,6 +242,34 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_slot. If no mapping present, keeps
+ * p_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple, TupleTableSlot **p_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_slot = mtstate->mt_partition_tuple_slot;
+	Assert(*p_slot != NULL);
+	ExecSetSlotDescriptor(*p_slot, map->outdesc);
+	ExecStoreTuple(tuple, *p_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -280,7 +311,38 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs to
+		 * be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_resultrel_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans-1)
+		{
+			int		map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+									  mtstate->mt_resultrel_maps[map_index],
+									  tuple, &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -290,7 +352,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -302,7 +364,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -347,23 +409,9 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+						mtstate->mt_partition_tupconv_maps[leaf_part_index],
+						tuple, &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -481,7 +529,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -673,6 +721,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool   *concurrently_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -681,6 +731,9 @@ ExecDelete(ModifyTableState *mtstate,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (concurrently_deleted)
+		*concurrently_deleted = false;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -824,6 +877,8 @@ ldelete:;
 					}
 				}
 				/* tuple already deleted; nothing to do */
+				if (concurrently_deleted)
+					*concurrently_deleted = true;
 				return NULL;
 
 			default:
@@ -848,8 +903,8 @@ ldelete:;
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
 						 mtstate->mt_transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -1038,12 +1093,51 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool	concurrently_deleted;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with partition
+			 * constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &concurrently_deleted, false, false);
+
+			/*
+			 * The row was already deleted by a concurrent DELETE. So we don't
+			 * have anything to update.
+			 */
+			if (concurrently_deleted)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1482,23 +1576,22 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	 */
 	if (mtstate->mt_transition_capture != NULL)
 	{
-		ResultRelInfo *resultRelInfos;
+		ResultRelInfo *resultRelInfo;
 		int		numResultRelInfos;
+		bool	tuple_routing = (mtstate->mt_partition_dispatch_info != NULL);
 
 		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (tuple_routing)
 		{
 			/*
 			 * For INSERT via partitioned table, so we need TupleDescs based
 			 * on the partition routing table.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
 			numResultRelInfos = mtstate->mt_num_partitions;
 		}
 		else
 		{
 			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
 			numResultRelInfos = mtstate->mt_nplans;
 		}
 
@@ -1512,8 +1605,15 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 		for (i = 0; i < numResultRelInfos; ++i)
 		{
+			/*
+			 * As stated above, mapping source is different for INSERT or
+			 * otherwise.
+			 */
+			resultRelInfo = (tuple_routing ?
+					mtstate->mt_partitions[i] : &mtstate->resultRelInfo[i]);
+
 			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
@@ -1746,7 +1846,8 @@ ExecModifyTable(ModifyTableState *node)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1791,9 +1892,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1865,6 +1969,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1902,18 +2015,28 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+											mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   &partition_dispatch_info,
 									   &partitions,
@@ -1926,6 +2049,43 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_partitions = num_partitions;
 		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
+	}
+
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root partition
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root. Skip this setup if it's not a partition key update or if
+	 * there are no partitions below this partitioned table.
+	 */
+	if (update_tuple_routing_needed && mtstate->mt_num_partitions > 0)
+	{
+		TupleConversionMap **tup_conv_maps;
+		TupleDesc		outdesc;
+
+		mtstate->mt_resultrel_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap*) * nplans);
+
+		/* Get tuple descriptor of the root partition. */
+		outdesc = RelationGetDescr(mtstate->mt_partition_dispatch_info[0]->reldesc);
+
+		resultRelInfo = mtstate->resultRelInfo;
+		tup_conv_maps = mtstate->mt_resultrel_maps;
+		for (i = 0; i < nplans; i++)
+		{
+			TupleDesc indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+			tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+								 gettext_noop("could not convert row type"));
+		}
 	}
 
 	/* Build state for collecting transition tuples */
@@ -1961,50 +2121,52 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE
+	 * row movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO qual
+		 * for each partition. Note that, if there are SubPlans in there, they
+		 * all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
 			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+			   mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco, firstVarno,
+												partrel, firstResultRel);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2015,7 +2177,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2052,20 +2214,25 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList, firstVarno,
+											partrel, firstResultRel);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2308,6 +2475,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/* Free transition tables */
 	if (node->mt_transition_capture != NULL)
@@ -2344,7 +2512,17 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 45a04b0..4156e02 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2256,6 +2257,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8d92c03..f2df72b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 379d92a..2ca8a71 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2094,6 +2095,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2516,6 +2518,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 86c811d..949053c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index f087ddb..064af0f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1291,7 +1291,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	rte = planner_rt_fetch(rel->relid, root);
 	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, rel->relid);
+		partitioned_rels = get_partitioned_child_rels(root, rel->relid, NULL);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e589d92..7e4f058 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2357,6 +2358,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6398,6 +6400,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6424,6 +6427,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2988c11..cf91907 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1042,6 +1042,7 @@ inheritance_planner(PlannerInfo *root)
 	Index		rti;
 	RangeTblEntry *parent_rte;
 	List	   *partitioned_rels = NIL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1356,9 +1357,15 @@ inheritance_planner(PlannerInfo *root)
 
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
+		Bitmapset  *all_part_cols = NULL;
+
+		partitioned_rels = get_partitioned_child_rels(root, parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/* Result path must go into outer query's FINAL upperrel */
@@ -1415,6 +1422,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2032,6 +2040,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6062,10 +6071,15 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: Only call this function on RTEs known to be partitioned tables.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6077,6 +6091,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74..b854d6c 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -99,6 +100,8 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
 static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
 						 Index rti);
+static Relation get_next_child(Relation oldrelation, ListCell **cell,
+						PartitionWalker *walker, bool is_partitioned);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
 						  Index newvarno,
@@ -1370,13 +1373,17 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	Oid			parentOID;
 	PlanRowMark *oldrc;
 	Relation	oldrelation;
+	Relation	newrelation;
 	LOCKMODE	lockmode;
 	List	   *inhOIDs;
 	List	   *appinfos;
-	ListCell   *l;
+	ListCell   *oids_cell;
 	bool		need_append;
+	bool		is_partitioned_resultrel;
 	PartitionedChildRelInfo *pcinfo;
+	PartitionWalker walker;
 	List	   *partitioned_child_rels = NIL;
+	Bitmapset  *all_part_cols = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1446,23 +1453,54 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	 */
 	oldrelation = heap_open(parentOID, NoLock);
 
+	/*
+	 * Remember whether it is a result relation and it is partitioned. We need
+	 * to decide the ordering of result rels based on this.
+	 */
+	is_partitioned_resultrel =
+		(oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE
+		 && rti == parse->resultRelation);
+
 	/* Scan the inheritance set and expand it */
 	appinfos = NIL;
 	need_append = false;
-	foreach(l, inhOIDs)
+	newrelation = oldrelation;
+
+	/* For non-partitioned result-rels, open the first child from inhOIDs */
+	if (!is_partitioned_resultrel)
+	{
+		oids_cell = list_head(inhOIDs);
+		newrelation = get_next_child(oldrelation, &oids_cell, &walker,
+									 is_partitioned_resultrel);
+	}
+	else
+	{
+		/*
+		 * For partitioned resultrels, we don't need the inhOIDs list itself,
+		 * because we anyways traverse the tree in canonical order; but we do
+		 * want to lock all the children in a consistent order (see
+		 * find_inheritance_children), so as to avoid unnecessary deadlocks.
+		 * Hence, the call to find_all_inheritors() above. The aim is to
+		 * generate the appinfos in canonical order so that the result rels,
+		 * if generated later, are in the same order as those of the leaf
+		 * partitions that are maintained during insert/update tuple routing.
+		 * Maintaining same order would speed up searching for a given leaf
+		 * partition in these result rels.
+		 */
+		list_free(inhOIDs);
+		inhOIDs = NIL;
+		partition_walker_init(&walker, oldrelation);
+	}
+
+	for (; newrelation != NULL;
+		 newrelation = get_next_child(oldrelation, &oids_cell, &walker,
+									  is_partitioned_resultrel))
 	{
-		Oid			childOID = lfirst_oid(l);
-		Relation	newrelation;
+		Oid			childOID = RelationGetRelid(newrelation);
 		RangeTblEntry *childrte;
 		Index		childRTindex;
 		AppendRelInfo *appinfo;
 
-		/* Open rel if needed; we already have required locks */
-		if (childOID != parentOID)
-			newrelation = heap_open(childOID, NoLock);
-		else
-			newrelation = oldrelation;
-
 		/*
 		 * It is possible that the parent table has children that are temp
 		 * tables of other backends.  We cannot safely access such tables
@@ -1535,8 +1573,12 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			}
 		}
 		else
+		{
 			partitioned_child_rels = lappend_int(partitioned_child_rels,
 												 childRTindex);
+			pull_child_partition_columns(&all_part_cols, newrelation,
+										 oldrelation);
+		}
 
 		/*
 		 * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
@@ -1575,6 +1617,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 
 	heap_close(oldrelation, NoLock);
 
+#ifdef DEBUG_PRINT_OIDS
+	print_oids(appinfos, parse->rtable);
+#endif
+
 	/*
 	 * If all the children were temp tables or a partitioned parent did not
 	 * have any leaf partitions, pretend it's a non-inheritance situation; we
@@ -1604,6 +1650,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 		pcinfo->parent_relid = rti;
 		pcinfo->child_rels = partitioned_child_rels;
+		pcinfo->all_part_cols = all_part_cols;
 		root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 	}
 
@@ -1612,6 +1659,45 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 }
 
 /*
+ * Get the next child in an inheritance tree.
+ *
+ * This function is called to traverse two different types of lists. If it's a
+ * list containing partitions, is_partitioned is true, and 'walker' is valid.
+ * Otherwise, 'cell' points to a position in the list of inheritance children.
+ * For partitions walker, the partition traversal is done using canonical
+ * ordering. Whereas, for inheritence children, list is already prepared, and
+ * is ordered depending upon the pg_inherit scan.
+ *
+ * oldrelation is the root relation in the inheritence tree. This is unused in
+ * case of is_partitioned=true.
+ */
+static Relation
+get_next_child(Relation oldrelation, ListCell **cell, PartitionWalker *walker,
+			   bool is_partitioned)
+{
+	if (is_partitioned)
+		return partition_walker_next(walker, NULL);
+	else
+	{
+		Oid		childOID;
+
+		if (!*cell)
+			return NULL; /* We are done with the list */
+
+		childOID = lfirst_oid(*cell);
+
+		/* Prepare to get the next child. */
+		*cell = lnext(*cell);
+
+		/* If it's the root relation, it is already open */
+		if (childOID != RelationGetRelid(oldrelation))
+			return heap_open(childOID, NoLock);
+		else
+			return oldrelation;
+	}
+}
+
+/*
  * make_inh_translation_list
  *	  Build the list of translations from parent Vars to child Vars for
  *	  an inheritance child.
@@ -2161,3 +2247,21 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
 	/* Now translate for this child */
 	return adjust_appendrel_attrs(root, node, appinfo);
 }
+
+#ifdef DEBUG_PRINT_OIDS
+static void
+print_oids(List *oid_list, List *rtable)
+{
+	ListCell   *cell;
+	StringInfoData oids_str;
+
+	initStringInfo(&oids_str);
+	foreach(cell, oid_list)
+	{
+		AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(cell);
+		RangeTblEntry *childrte = (RangeTblEntry *) list_nth(rtable, appinfo->child_relid-1);
+		appendStringInfo(&oids_str, "%s ", get_rel_name(childrte->relid));
+	}
+	elog(NOTICE, "expanded oids: %s", oids_str.data);
+}
+#endif
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f2d6385..f63edf4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3161,6 +3161,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3174,6 +3176,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3241,6 +3244,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index f10879a..e6af17d 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -79,11 +85,15 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent);
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int lockmode, int *num_parted,
@@ -98,4 +108,8 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
+
 #endif							/* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 59c28b7..94f8acf 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,9 +210,11 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -217,6 +222,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 85fac8a..276b65b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -959,9 +959,13 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
 	/* Per partition tuple conversion map */
+	TupleConversionMap **mt_partition_tupconv_maps;
+	/* Per resultRelInfo conversion map to convert tuples to root partition */
+	TupleConversionMap **mt_resultrel_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 									/* controls transition table population */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f1a1b24..cd670b9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9bae3c6..3013964 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2019,6 +2020,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2027,6 +2032,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 0c0549d..d35f448 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -235,6 +235,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..20d4878 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,185 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..edaf19a 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,126 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+insert into part_a_1_a_10 values ('a', 1);
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#126

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

over 8 years ago

In reply to: Amit Khandekar (#125)

Re: UPDATE of partition key

On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.

I have applied attach patch and got below observation.

Observation : if join producing multiple output rows for a given row to be
modified. I am seeing here it is updating a row and also inserting rows in
target table. hence after update total count of table got incremented.

below are steps:
postgres=# create table part_upd (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_upd1 partition of part_upd for values from
(minvalue) to (-10);
CREATE TABLE
postgres=# create table part_upd2 partition of part_upd for values from
(-10) to (0);
CREATE TABLE
postgres=# create table part_upd3 partition of part_upd for values from (0)
to (10);
CREATE TABLE
postgres=# create table part_upd4 partition of part_upd for values from
(10) to (maxvalue);
CREATE TABLE
postgres=# insert into part_upd select i,i from generate_series(-30,30,3)i;
INSERT 0 21

*postgres=# select count(*) from part_upd; count ------- 21(1 row)*
postgres=#
postgres=# create table non_part_upd (a int);
CREATE TABLE
postgres=# insert into non_part_upd select i%2 from
generate_series(-30,30,5)i;
INSERT 0 13
postgres=# update part_upd t1 set a = (t2.a+10) from non_part_upd t2 where
t2.a = t1.b;
UPDATE 7

*postgres=# select count(*) from part_upd; count ------- 27(1 row)*
postgres=# select tableoid::regclass,* from part_upd;
tableoid | a | b
-----------+-----+-----
part_upd1 | -30 | -30
part_upd1 | -27 | -27
part_upd1 | -24 | -24
part_upd1 | -21 | -21
part_upd1 | -18 | -18
part_upd1 | -15 | -15
part_upd1 | -12 | -12
part_upd2 | -9 | -9
part_upd2 | -6 | -6
part_upd2 | -3 | -3
part_upd3 | 3 | 3
part_upd3 | 6 | 6
part_upd3 | 9 | 9
part_upd4 | 12 | 12
part_upd4 | 15 | 15
part_upd4 | 18 | 18
part_upd4 | 21 | 21
part_upd4 | 24 | 24
part_upd4 | 27 | 27
part_upd4 | 30 | 30

* part_upd4 | 10 | 0 part_upd4 | 10 | 0 part_upd4 | 10 |
0 part_upd4 | 10 | 0 part_upd4 | 10 | 0 part_upd4 | 10 |
0 part_upd4 | 10 | 0*(27 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

#127

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Rajkumar Raghuwanshi (#126)

Re: UPDATE of partition key

On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.

I have applied attach patch and got below observation.

Observation : if join producing multiple output rows for a given row to be
modified. I am seeing here it is updating a row and also inserting rows in
target table. hence after update total count of table got incremented.

Thanks for catching this Rajkumar.

So after the row to be updated is already moved to another partition,
when the next join output row corresponds to the same row which is
moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
HeapTupleSelfUpdated, and this is not handled. So even when
ExecDelete() finds that the row is already deleted, we still call
ExecInsert(), so a new row is inserted. In ExecDelete(), we should
indicate that the row is already deleted. In the existing patch, there
is a parameter concurrenty_deleted for ExecDelete() which indicates
that the row is concurrently deleted. I think we can make this
parameter for both of these purposes so as to avoid ExecInsert() for
both these scenarios. Will work on a patch.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

over 8 years ago

In reply to: Amit Khandekar (#127)

Re: UPDATE of partition key

On Tue, Jul 25, 2017 at 3:54 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

On 25 July 2017 at 15:02, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Mon, Jul 24, 2017 at 11:23 AM, Amit Khandekar <amitdkhan.pg@gmail.com

wrote:

Attached update-partition-key_v13.patch now contains this
make_resultrels_ordered.patch changes.

I have applied attach patch and got below observation.

Observation : if join producing multiple output rows for a given row to

be

modified. I am seeing here it is updating a row and also inserting rows

in

target table. hence after update total count of table got incremented.

Thanks for catching this Rajkumar.

So after the row to be updated is already moved to another partition,
when the next join output row corresponds to the same row which is
moved, that row is now deleted, so ExecDelete()=>heap_delete() gets
HeapTupleSelfUpdated, and this is not handled. So even when
ExecDelete() finds that the row is already deleted, we still call
ExecInsert(), so a new row is inserted. In ExecDelete(), we should
indicate that the row is already deleted. In the existing patch, there
is a parameter concurrenty_deleted for ExecDelete() which indicates
that the row is concurrently deleted. I think we can make this
parameter for both of these purposes so as to avoid ExecInsert() for
both these scenarios. Will work on a patch.

Thanks Amit.

Got one more observation : update... returning is not working with whole
row reference. please take a look.

postgres=# create table part (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_p1 partition of part for values from
(minvalue) to (0);
CREATE TABLE
postgres=# create table part_p2 partition of part for values from (0) to
(maxvalue);
CREATE TABLE
postgres=# insert into part values (10,1);
INSERT 0 1
postgres=# insert into part values (20,2);
INSERT 0 1
postgres=# update part t1 set a = b returning t1;
ERROR: unexpected whole-row reference found in partition key

#129

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#124)

Re: UPDATE of partition key

On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.

Hmm, I like the approach you've taken here in general, but I think it
needs cleanup.

+typedef struct ParentChild

This is a pretty generic name. Pick something more specific and informative.

+static List *append_rel_partition_oids(List *rel_list, Relation rel);

One could be forgiven for thinking that this function was just going
to append OIDs, but it actually appends ParentChild structures, so I
think the name needs work.

+List *append_rel_partition_oids(List *rel_list, Relation rel)

Style. Please pgindent your patches.

+#ifdef DEBUG_PRINT_OIDS
+    print_oids(*leaf_part_oids);
+#endif

I'd just rip out this debug stuff once you've got this working, but if
we keep it, it certainly can't have a name as generic as print_oids()
when it's actually doing something with a list of ParentChild
structures. Also, it prints names, not OIDs. And DEBUG_PRINT_OIDS is
no good for the same reasons.

+    if (RelationGetPartitionDesc(rel))
+        walker->rels_list = append_rel_partition_oids(walker->rels_list, rel);

Every place that calls append_rel_partition_oids guards that call with
if (RelationGetPartitionDesc(...)). It seems to me that it would be
simpler to remove those tests and instead just replace the
Assert(partdesc) inside that function with if (!partdesc) return;

Is there any real benefit in this "walker" interface? It looks to me
like it might be simpler to just change things around so that it
returns a list of OIDs, like find_all_inheritors, but generated
differently. Then if you want bound-ordering rather than
OID-ordering, you just do this:

list_free(inhOids);
inhOids = get_partition_oids_in_bound_order(rel);

That'd remove the need for some if/then logic as you've currently got
in get_next_child().

+    is_partitioned_resultrel =
+        (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE
+         && rti == parse->resultRelation);

I suspect this isn't correct for a table that contains wCTEs, because
there would in that case be multiple result relations.

I think we should always expand in bound order rather than only when
it's a result relation. I think for partition-wise join, we're going
to want to do it this way for all relations in the query, or at least
for all relations in the query that might possibly be able to
participate in a partition-wise join. If there are multiple cases
that are going to need this ordering, it's hard for me to accept the
idea that it's worth the complexity of trying to keep track of when we
expanded things in one order vs. another. There are other
applications of having things in bound order too, like MergeAppend ->
Append strength-reduction (which might not be legal anyway if there
are list partitions with multiple, non-contiguous list bounds or if
any NULL partition doesn't end up in the right place in the order, but
there will be lots of cases where it can work).

On another note, did you do anything about the suggestion Thomas made
in /messages/by-id/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com
?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#130

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Robert Haas (#129)

2 attachment(s)

Re: UPDATE of partition key

On 2017/07/26 6:07, Robert Haas wrote:

On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.

I suspect this isn't correct for a table that contains wCTEs, because
there would in that case be multiple result relations.

I think we should always expand in bound order rather than only when
it's a result relation. I think for partition-wise join, we're going
to want to do it this way for all relations in the query, or at least
for all relations in the query that might possibly be able to
participate in a partition-wise join. If there are multiple cases
that are going to need this ordering, it's hard for me to accept the
idea that it's worth the complexity of trying to keep track of when we
expanded things in one order vs. another. There are other
applications of having things in bound order too, like MergeAppend ->
Append strength-reduction (which might not be legal anyway if there
are list partitions with multiple, non-contiguous list bounds or if
any NULL partition doesn't end up in the right place in the order, but
there will be lots of cases where it can work).

Sorry to be responding this late to the Amit's make_resultrel_ordered
patch itself, but I agree that we should teach the planner to *always*
expand partitioned tables in the partition bound order.

When working on something else, I ended up writing a prerequisite patch
that refactors RelationGetPartitionDispatchInfo() to not be too tied to
its current usage for tuple-routing, so that it can now be used in the
planner (for example, in expand_inherited_rtentry(), instead of
find_all_inheritors()). If we could adopt that patch, we can focus on the
update partition row movement issues more closely on this thread, rather
than the concerns about the order that planner puts partitions into.

I checked that we get the same result relation order with both the
patches, but I would like to highlight a notable difference here between
the approaches taken by our patches. In my patch, I have now taught
RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
in the tree, because we need to look at its partition descriptor to
collect partition OIDs and bounds. We can defer locking (and opening the
relation descriptor of) leaf partitions to a point where planner has
determined that the partition will be accessed after all (not pruned),
which will be done in a separate patch of course.

Sorry again that I didn't share this patch sooner.

Thanks,
Amit

Attachments:

0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchtext/plain; charset=UTF-8; name=0001-Decouple-RelationGetPartitionDispatchInfo-from-execu.patchDownload

From 7a22aedc7c1ae8e1568745c99cf1d11d42cf59d9 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 24 Jul 2017 18:59:57 +0900
Subject: [PATCH 1/3] Decouple RelationGetPartitionDispatchInfo() from executor

Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code.  That
include locking considerations and responsibilities for releasing
relcache references, etc.  That makes it useless for usage in other
places such as during planning.
---
 src/backend/catalog/partition.c        | 326 +++++++++++++++++----------------
 src/backend/commands/copy.c            |  35 ++--
 src/backend/executor/execMain.c        | 156 ++++++++++++++--
 src/backend/executor/nodeModifyTable.c |  29 ++-
 src/include/catalog/partition.h        |  53 ++----
 src/include/executor/executor.h        |   4 +-
 src/include/nodes/execnodes.h          |  53 +++++-
 7 files changed, 409 insertions(+), 247 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index e20ddce2db..e07701d5e8 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,24 @@ typedef struct PartitionRangeBound
 	bool		lower;			/* this is the lower (vs upper) bound */
 } PartitionRangeBound;
 
+/*-----------------------
+ * PartitionDispatchData - information of partitions of one partitioned table
+ *						   in a partition tree
+ *
+ *	partkey		Partition key of the table
+ *	partdesc	Partition descriptor of the table
+ *	indexes		Array with partdesc->nparts members (for details on what the
+ *				individual value represents, see the comments in
+ *				RelationGetPartitionDispatchInfo())
+ *-----------------------
+ */
+typedef struct PartitionDispatchData
+{
+	PartitionKey	partkey;	/* Points into the table's relcache entry */
+	PartitionDesc	partdesc;	/* Ditto */
+	int			   *indexes;
+} PartitionDispatchData;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -972,178 +990,167 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-	do\
-	{\
-		int		i;\
-		for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-		{\
-			(partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
-			(parents) = lappend((parents), (rel));\
-		}\
-	} while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
+ *		Returns necessary information for each partition in the partition
+ *		tree rooted at rel
  *
- * All the partitions will be locked with lockmode, unless it is NoLock.
- * A list of the OIDs of all the leaf partitions of rel is returned in
- * *leaf_part_oids.
+ * Information returned includes the following: *ptinfos contains a list of
+ * PartitionedTableInfo objects, one for each partitioned table (with at least
+ * one member, that is, one for the root partitioned table), *leaf_part_oids
+ * contains a list of the OIDs of of all the leaf partitions.
+ *
+ * Note that we lock only those partitions that are partitioned tables, because
+ * we need to look at its relcache entry to get its PartitionKey and its
+ * PartitionDesc. It's the caller's responsibility to lock the leaf partitions
+ * that will actually be accessed during a given query.
  */
-PartitionDispatch *
+void
 RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
-								 int *num_parted, List **leaf_part_oids)
+								 List **ptinfos, List **leaf_part_oids)
 {
-	PartitionDispatchData **pd;
-	List	   *all_parts = NIL,
-			   *all_parents = NIL,
-			   *parted_rels,
-			   *parted_rel_parents;
+	List	   *all_parts,
+			   *all_parents;
 	ListCell   *lc1,
 			   *lc2;
 	int			i,
-				k,
 				offset;
 
 	/*
-	 * Lock partitions and make a list of the partitioned ones to prepare
-	 * their PartitionDispatch objects below.
+	 * We rely on the relcache to traverse the partition tree, building
+	 * both the leaf partition OIDs list and the PartitionedTableInfo list.
+	 * Starting with the root partitioned table for which we already have the
+	 * relcache entry, we look at its partition descriptor to get the
+	 * partition OIDs.  For partitions that are themselves partitioned tables,
+	 * we get their relcache entries after locking them with lockmode and
+	 * queue their partitions to be looked at later.  Leaf partitions are
+	 * added to the result list without locking.  For each partitioned table,
+	 * we build a PartitionedTableInfo object and add it to the other result
+	 * list.
 	 *
-	 * Cannot use find_all_inheritors() here, because then the order of OIDs
-	 * in parted_rels list would be unknown, which does not help, because we
-	 * assign indexes within individual PartitionDispatch in an order that is
-	 * predetermined (determined by the order of OIDs in individual partition
-	 * descriptors).
+	 * Since RelationBuildPartitionDescriptor() puts partitions in a canonical
+	 * order determined by comparing partition bounds, we can rely that
+	 * concurrent backends see the partitions in the same order, ensuring that
+	 * there are no deadlocks when locking the partitions.
 	 */
-	*num_parted = 1;
-	parted_rels = list_make1(rel);
-	/* Root partitioned table has no parent, so NULL for parent */
-	parted_rel_parents = list_make1(NULL);
-	APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
+	i = offset = 0;
+	*ptinfos = *leaf_part_oids = NIL;
+
+	/* Start with the root table. */
+	all_parts = list_make1_oid(RelationGetRelid(rel));
+	all_parents = list_make1_oid(InvalidOid);
 	forboth(lc1, all_parts, lc2, all_parents)
 	{
-		Relation	partrel = heap_open(lfirst_oid(lc1), lockmode);
-		Relation	parent = lfirst(lc2);
-		PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
+		Oid		partrelid = lfirst_oid(lc1);
+		Oid		parentrelid = lfirst_oid(lc2);
 
-		/*
-		 * If this partition is a partitioned table, add its children to the
-		 * end of the list, so that they are processed as well.
-		 */
-		if (partdesc)
+		if (get_rel_relkind(partrelid) == RELKIND_PARTITIONED_TABLE)
 		{
-			(*num_parted)++;
-			parted_rels = lappend(parted_rels, partrel);
-			parted_rel_parents = lappend(parted_rel_parents, parent);
-			APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
-		}
-		else
-			heap_close(partrel, NoLock);
+			int		j,
+					k;
+			Relation		partrel;
+			PartitionKey	partkey;
+			PartitionDesc	partdesc;
+			PartitionedTableInfo   *ptinfo;
+			PartitionDispatch		pd;
+
+			if (partrelid != RelationGetRelid(rel))
+				partrel = heap_open(partrelid, lockmode);
+			else
+				partrel = rel;
 
-		/*
-		 * We keep the partitioned ones open until we're done using the
-		 * information being collected here (for example, see
-		 * ExecEndModifyTable).
-		 */
-	}
+			partkey = RelationGetPartitionKey(partrel);
+			partdesc = RelationGetPartitionDesc(partrel);
+
+			ptinfo = (PartitionedTableInfo *)
+									palloc0(sizeof(PartitionedTableInfo));
+			ptinfo->relid = partrelid;
+			ptinfo->parentid = parentrelid;
+
+			ptinfo->pd = pd = (PartitionDispatchData *)
+									palloc0(sizeof(PartitionDispatchData));
+			pd->partkey = partkey;
 
-	/*
-	 * We want to create two arrays - one for leaf partitions and another for
-	 * partitioned tables (including the root table and internal partitions).
-	 * While we only create the latter here, leaf partition array of suitable
-	 * objects (such as, ResultRelInfo) is created by the caller using the
-	 * list of OIDs we return.  Indexes into these arrays get assigned in a
-	 * breadth-first manner, whereby partitions of any given level are placed
-	 * consecutively in the respective arrays.
-	 */
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	*leaf_part_oids = NIL;
-	i = k = offset = 0;
-	forboth(lc1, parted_rels, lc2, parted_rel_parents)
-	{
-		Relation	partrel = lfirst(lc1);
-		Relation	parent = lfirst(lc2);
-		PartitionKey partkey = RelationGetPartitionKey(partrel);
-		TupleDesc	tupdesc = RelationGetDescr(partrel);
-		PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
-		int			j,
-					m;
-
-		pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-		pd[i]->reldesc = partrel;
-		pd[i]->key = partkey;
-		pd[i]->keystate = NIL;
-		pd[i]->partdesc = partdesc;
-		if (parent != NULL)
-		{
 			/*
-			 * For every partitioned table other than root, we must store a
-			 * tuple table slot initialized with its tuple descriptor and a
-			 * tuple conversion map to convert a tuple from its parent's
-			 * rowtype to its own. That is to make sure that we are looking at
-			 * the correct row using the correct tuple descriptor when
-			 * computing its partition key for tuple routing.
-			 */
-			pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
-			pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-												   tupdesc,
-												   gettext_noop("could not convert row type"));
-		}
-		else
-		{
-			/* Not required for the root partitioned table */
-			pd[i]->tupslot = NULL;
-			pd[i]->tupmap = NULL;
-		}
-		pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
+			 * Pin the partition descriptor before stashing the references to the
+			 * information contained in it into this PartitionDispatch object.
+			 *
+			PinPartitionDesc(partdesc);*/
+			pd->partdesc = partdesc;
 
-		/*
-		 * Indexes corresponding to the internal partitions are multiplied by
-		 * -1 to distinguish them from those of leaf partitions.  Encountering
-		 * an index >= 0 means we found a leaf partition, which is immediately
-		 * returned as the partition we are looking for.  A negative index
-		 * means we found a partitioned table, whose PartitionDispatch object
-		 * is located at the above index multiplied back by -1.  Using the
-		 * PartitionDispatch object, search is continued further down the
-		 * partition tree.
-		 */
-		m = 0;
-		for (j = 0; j < partdesc->nparts; j++)
-		{
-			Oid			partrelid = partdesc->oids[j];
+			/*
+			 * The values contained in the following array correspond to
+			 * indexes of this table's partitions in the global sequence of
+			 * all the partitions contained in the partition tree rooted at
+			 * rel, traversed in a breadh-first manner.  The values should be
+			 * such that we will be able to distinguish the leaf partitions
+			 * from the non-leaf partitions, because they are returned to
+			 * to the caller in separate structures from where they will be
+			 * accessed.  The way that's done is described below:
+			 *
+			 * Leaf partition OIDs are put into the global leaf_part_oids list,
+			 * and for each one, the value stored is its ordinal position in
+			 * the list minus 1.
+			 *
+			 * PartitionedTableInfo objects corresponding to partitions that
+			 * are partitioned tables are put into the global ptinfos[] list,
+			 * and for each one, the value stored is its ordinal position in
+			 * the list multiplied by -1.
+			 *
+			 * So while looking at the values in the indexes array, if one
+			 * gets zero or a positive value, then it's a leaf partition,
+			 * Otherwise, it's a partitioned table.
+			 */
+			pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
 
-			if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-			{
-				*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-				pd[i]->indexes[j] = k++;
-			}
-			else
+			k = 0;
+			for (j = 0; j < partdesc->nparts; j++)
 			{
+				Oid			partrelid = partdesc->oids[j];
+
 				/*
-				 * offset denotes the number of partitioned tables of upper
-				 * levels including those of the current level.  Any partition
-				 * of this table must belong to the next level and hence will
-				 * be placed after the last partitioned table of this level.
+				 * Queue this partition so that it will be processed later
+				 * by the outer loop.
 				 */
-				pd[i]->indexes[j] = -(1 + offset + m);
-				m++;
+				all_parts = lappend_oid(all_parts, partrelid);
+				all_parents = lappend_oid(all_parents,
+										  RelationGetRelid(partrel));
+
+				if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+				{
+					*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
+					pd->indexes[j] = i++;
+				}
+				else
+				{
+					/*
+					 * offset denotes the number of partitioned tables that
+					 * we have already processed.  k counts the number of
+					 * partitions of this table that were found to be
+					 * partitioned tables.
+					 */
+					pd->indexes[j] = -(1 + offset + k);
+					k++;
+				}
 			}
-		}
-		i++;
 
-		/*
-		 * This counts the number of partitioned tables at upper levels
-		 * including those of the current level.
-		 */
-		offset += m;
+			offset += k;
+
+			/*
+			 * Release the relation descriptor.  Lock that we have on the
+			 * table will keep the PartitionDesc that is pointing into
+			 * RelationData intact, a pointer to which hope to keep
+			 * through this transaction's commit.
+			 * (XXX - how true is that?)
+			 */
+			if (partrel != rel)
+				heap_close(partrel, NoLock);
+
+			*ptinfos = lappend(*ptinfos, ptinfo);
+		}
 	}
 
-	return pd;
+	Assert(i == list_length(*leaf_part_oids));
+	Assert((offset + 1) == list_length(*ptinfos));
 }
 
 /* Module-local functions */
@@ -1855,7 +1862,7 @@ generate_partition_qual(Relation rel)
  * ----------------
  */
 void
-FormPartitionKeyDatum(PartitionDispatch pd,
+FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
 					  TupleTableSlot *slot,
 					  EState *estate,
 					  Datum *values,
@@ -1864,20 +1871,21 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 	ListCell   *partexpr_item;
 	int			i;
 
-	if (pd->key->partexprs != NIL && pd->keystate == NIL)
+	if (keyinfo->key->partexprs != NIL && keyinfo->keystate == NIL)
 	{
 		/* Check caller has set up context correctly */
 		Assert(estate != NULL &&
 			   GetPerTupleExprContext(estate)->ecxt_scantuple == slot);
 
 		/* First time through, set up expression evaluation state */
-		pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);
+		keyinfo->keystate = ExecPrepareExprList(keyinfo->key->partexprs,
+												estate);
 	}
 
-	partexpr_item = list_head(pd->keystate);
-	for (i = 0; i < pd->key->partnatts; i++)
+	partexpr_item = list_head(keyinfo->keystate);
+	for (i = 0; i < keyinfo->key->partnatts; i++)
 	{
-		AttrNumber	keycol = pd->key->partattrs[i];
+		AttrNumber	keycol = keyinfo->key->partattrs[i];
 		Datum		datum;
 		bool		isNull;
 
@@ -1914,13 +1922,13 @@ FormPartitionKeyDatum(PartitionDispatch pd,
  * the latter case.
  */
 int
-get_partition_for_tuple(PartitionDispatch *pd,
+get_partition_for_tuple(PartitionTupleRoutingInfo **ptrinfos,
 						TupleTableSlot *slot,
 						EState *estate,
-						PartitionDispatchData **failed_at,
+						PartitionTupleRoutingInfo **failed_at,
 						TupleTableSlot **failed_slot)
 {
-	PartitionDispatch parent;
+	PartitionTupleRoutingInfo *parent;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	int			cur_offset,
@@ -1931,11 +1939,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 
 	/* start with the root partitioned table */
-	parent = pd[0];
+	parent = ptrinfos[0];
 	while (true)
 	{
-		PartitionKey key = parent->key;
-		PartitionDesc partdesc = parent->partdesc;
+		PartitionKey  key = parent->pd->partkey;
+		PartitionDesc partdesc = parent->pd->partdesc;
 		TupleTableSlot *myslot = parent->tupslot;
 		TupleConversionMap *map = parent->tupmap;
 
@@ -1967,7 +1975,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
 		 * So update ecxt_scantuple accordingly.
 		 */
 		ecxt->ecxt_scantuple = slot;
-		FormPartitionKeyDatum(parent, slot, estate, values, isnull);
+		FormPartitionKeyDatum(parent->keyinfo, slot, estate, values, isnull);
 
 		if (key->strategy == PARTITION_STRATEGY_RANGE)
 		{
@@ -2038,13 +2046,13 @@ get_partition_for_tuple(PartitionDispatch *pd,
 			*failed_slot = slot;
 			break;
 		}
-		else if (parent->indexes[cur_index] >= 0)
+		else if (parent->pd->indexes[cur_index] >= 0)
 		{
-			result = parent->indexes[cur_index];
+			result = parent->pd->indexes[cur_index];
 			break;
 		}
 		else
-			parent = pd[-parent->indexes[cur_index]];
+			parent = ptrinfos[-parent->pd->indexes[cur_index]];
 	}
 
 error_exit:
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 53e296559a..b3de3de454 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,8 +165,8 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;	/* Number of entries in the above array */
+	PartitionTupleRoutingInfo **ptrinfos;
+	int			num_parted;		/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
 	ResultRelInfo *partitions;	/* Per partition result relation */
 	TupleConversionMap **partition_tupconv_maps;
@@ -1425,7 +1425,7 @@ BeginCopy(ParseState *pstate,
 		/* Initialize state for CopyFrom tuple routing. */
 		if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		{
-			PartitionDispatch *partition_dispatch_info;
+			PartitionTupleRoutingInfo **ptrinfos;
 			ResultRelInfo *partitions;
 			TupleConversionMap **partition_tupconv_maps;
 			TupleTableSlot *partition_tuple_slot;
@@ -1434,13 +1434,13 @@ BeginCopy(ParseState *pstate,
 
 			ExecSetupPartitionTupleRouting(rel,
 										   1,
-										   &partition_dispatch_info,
+										   &ptrinfos,
 										   &partitions,
 										   &partition_tupconv_maps,
 										   &partition_tuple_slot,
 										   &num_parted, &num_partitions);
-			cstate->partition_dispatch_info = partition_dispatch_info;
-			cstate->num_dispatch = num_parted;
+			cstate->ptrinfos = ptrinfos;
+			cstate->num_parted = num_parted;
 			cstate->partitions = partitions;
 			cstate->num_partitions = num_partitions;
 			cstate->partition_tupconv_maps = partition_tupconv_maps;
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		cstate->partition_dispatch_info != NULL ||
+		cstate->ptrinfos != NULL ||
 		cstate->volatile_defexprs)
 	{
 		useHeapMultiInsert = false;
@@ -2573,7 +2573,7 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
+		if (cstate->ptrinfos)
 		{
 			int			leaf_part_index;
 			TupleConversionMap *map;
@@ -2587,7 +2587,7 @@ CopyFrom(CopyState cstate)
 			 * partition, respectively.
 			 */
 			leaf_part_index = ExecFindPartition(resultRelInfo,
-												cstate->partition_dispatch_info,
+												cstate->ptrinfos,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
@@ -2818,23 +2818,20 @@ CopyFrom(CopyState cstate)
 
 	ExecCloseIndices(resultRelInfo);
 
-	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_dispatch_info)
+	/* Close all the leaf partitions and their indices */
+	if (cstate->ptrinfos)
 	{
 		int			i;
 
 		/*
-		 * Remember cstate->partition_dispatch_info[0] corresponds to the root
-		 * partitioned table, which we must not try to close, because it is
-		 * the main target table of COPY that will be closed eventually by
-		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
+		 * cstate->ptrinfo[0] corresponds to the root partitioned table, for
+		 * which we didn't create tupslot.
 		 */
-		for (i = 1; i < cstate->num_dispatch; i++)
+		for (i = 1; i < cstate->num_parted; i++)
 		{
-			PartitionDispatch pd = cstate->partition_dispatch_info[i];
+			PartitionTupleRoutingInfo *ptrinfo = cstate->ptrinfos[i];
 
-			heap_close(pd->reldesc, NoLock);
-			ExecDropSingleTupleTableSlot(pd->tupslot);
+			ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 78cbcd1a32..428172ae8e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3214,8 +3214,8 @@ EvalPlanQualEnd(EPQState *epqstate)
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *		every partitioned table in the partition tree
+ * 'ptrinfos' receives an array of PartitionTupleRoutingInfo objects with one
+ *		entry for each partitioned table in the partition tree
  * 'partitions' receives an array of ResultRelInfo objects with one entry for
  *		every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
@@ -3237,7 +3237,7 @@ EvalPlanQualEnd(EPQState *epqstate)
 void
 ExecSetupPartitionTupleRouting(Relation rel,
 							   Index resultRTindex,
-							   PartitionDispatch **pd,
+							   PartitionTupleRoutingInfo ***ptrinfos,
 							   ResultRelInfo **partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
@@ -3245,13 +3245,135 @@ ExecSetupPartitionTupleRouting(Relation rel,
 {
 	TupleDesc	tupDesc = RelationGetDescr(rel);
 	List	   *leaf_parts;
+	List	   *ptinfos = NIL;
 	ListCell   *cell;
 	int			i;
 	ResultRelInfo *leaf_part_rri;
+	Relation	parent;
 
 	/* Get the tuple-routing information and lock partitions */
-	*pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
-										   &leaf_parts);
+	RelationGetPartitionDispatchInfo(rel, RowExclusiveLock,
+									 &ptinfos, &leaf_parts);
+
+	/*
+	 * The ptinfos list contains PartitionedTableInfo objects for all the
+	 * partitioned tables in the partition tree.  From the, we construct
+	 * an array of PartitionTupleRoutingInfo objects to be used during
+	 * tuple-routing.
+	 */
+	*num_parted = list_length(ptinfos);
+	*ptrinfos = (PartitionTupleRoutingInfo **) palloc0(*num_parted *
+										sizeof(PartitionTupleRoutingInfo *));
+
+	/*
+	 * Free the List structure itself as we go through.
+	 * (open-coded list_free)
+	 */
+	i = 0;
+	cell = list_head(ptinfos);
+	parent = NULL;
+	while (cell)
+	{
+		ListCell   *tmp = cell;
+		PartitionedTableInfo *ptinfo = lfirst(tmp),
+							 *next_ptinfo;
+		Relation		partrel;
+		PartitionTupleRoutingInfo *ptrinfo;
+
+		if (lnext(tmp))
+			next_ptinfo = lfirst(lnext(tmp));
+
+		/*
+		 * RelationGetPartitionDispatchInfo() already locked the Partitioned
+		 * tables.
+		 */
+		if (ptinfo->relid != RelationGetRelid(rel))
+			partrel = heap_open(ptinfo->relid, NoLock);
+		else
+			partrel = rel;
+
+		ptrinfo = (PartitionTupleRoutingInfo *)
+							palloc0(sizeof(PartitionTupleRoutingInfo));
+		ptrinfo->relid = ptinfo->relid;
+
+		/* Stash a reference to this PartitionDispatch. */
+		ptrinfo->pd = ptinfo->pd;
+
+		/* State for extracting partition key from tuples will go here. */
+		ptrinfo->keyinfo = (PartitionKeyInfo *)
+								palloc0(sizeof(PartitionKeyInfo));
+		ptrinfo->keyinfo->key = RelationGetPartitionKey(partrel);
+		ptrinfo->keyinfo->keystate = NIL;
+
+		/*
+		 * For every partitioned table other than root, we must store a tuple
+		 * table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own.  That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		if (ptinfo->parentid != InvalidOid)
+		{
+			TupleDesc	tupdesc = RelationGetDescr(partrel);
+
+			/* Open the parent relation descriptor if not already done. */
+			if (ptinfo->parentid == RelationGetRelid(rel))
+			{
+				parent = rel;
+			}
+			else if (parent == NULL)
+			{
+				/* Locked by RelationGetPartitionDispatchInfo(). */
+				parent = heap_open(ptinfo->parentid, NoLock);
+			}
+
+			ptrinfo->tupslot = MakeSingleTupleTableSlot(tupdesc);
+			ptrinfo->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+													 tupdesc,
+								  gettext_noop("could not convert row type"));
+
+			/*
+			 * Close the parent descriptor, if the next partitioned table in
+			 * the list is not a sibling, because it will have a different
+			 * parent if so.
+			 */
+			if (parent && parent != rel &&
+				next_ptinfo->parentid != ptinfo->parentid)
+			{
+				heap_close(parent, NoLock);
+				parent = NULL;
+			}
+
+			/*
+			 * Release the relation descriptor.  Lock that we have on the
+			 * table will keep the PartitionDesc that is pointing into
+			 * RelationData intact, a pointer to which hope to keep
+			 * through this transaction's commit.
+			 * (XXX - how true is that?)
+			 */
+			if (partrel != rel)
+				heap_close(partrel, NoLock);
+		}
+		else
+		{
+			/* Not required for the root partitioned table */
+			ptrinfo->tupslot = NULL;
+			ptrinfo->tupmap = NULL;
+		}
+
+		(*ptrinfos)[i++] = ptrinfo;
+
+		/* Free the ListCell. */
+		cell = lnext(cell);
+		pfree(tmp);
+	}
+
+	/* Free the List itself. */
+	if (ptinfos)
+		pfree(ptinfos);
+
+	/* For leaf partitions, we build ResultRelInfos and TupleConversionMaps. */
 	*num_partitions = list_length(leaf_parts);
 	*partitions = (ResultRelInfo *) palloc(*num_partitions *
 										   sizeof(ResultRelInfo));
@@ -3274,11 +3396,11 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		TupleDesc	part_tupdesc;
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * RelationGetPartitionDispatchInfo didn't lock the leaf partitions,
+		 * so lock here.  Note that each of the relations in *partitions are
+		 * eventually closed (when the plan is shut down, for instance).
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		partrel = heap_open(lfirst_oid(cell), RowExclusiveLock);
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -3291,7 +3413,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * partition from the parent's type to the partition's.
 		 */
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
+								 gettext_noop("could not convert row type"));
 
 		InitResultRelInfo(leaf_part_rri,
 						  partrel,
@@ -3325,11 +3447,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
  * by get_partition_for_tuple() unchanged.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
-				  TupleTableSlot *slot, EState *estate)
+ExecFindPartition(ResultRelInfo *resultRelInfo,
+				  PartitionTupleRoutingInfo **ptrinfos,
+				  TupleTableSlot *slot,
+				  EState *estate)
 {
 	int			result;
-	PartitionDispatchData *failed_at;
+	PartitionTupleRoutingInfo *failed_at;
 	TupleTableSlot *failed_slot;
 
 	/*
@@ -3339,7 +3463,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	if (resultRelInfo->ri_PartitionCheck)
 		ExecPartitionCheck(resultRelInfo, slot, estate);
 
-	result = get_partition_for_tuple(pd, slot, estate,
+	result = get_partition_for_tuple(ptrinfos, slot, estate,
 									 &failed_at, &failed_slot);
 	if (result < 0)
 	{
@@ -3349,9 +3473,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		char	   *val_desc;
 		ExprContext *ecxt = GetPerTupleExprContext(estate);
 
-		failed_rel = failed_at->reldesc;
+		failed_rel = heap_open(failed_at->relid, NoLock);
 		ecxt->ecxt_scantuple = failed_slot;
-		FormPartitionKeyDatum(failed_at, failed_slot, estate,
+		FormPartitionKeyDatum(failed_at->keyinfo, failed_slot, estate,
 							  key_values, key_isnull);
 		val_desc = ExecBuildSlotPartitionKeyDescription(failed_rel,
 														key_values,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 77ba15dd90..61e6dfa884 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -277,7 +277,7 @@ ExecInsert(ModifyTableState *mtstate,
 	resultRelInfo = estate->es_result_relation_info;
 
 	/* Determine the partition to heap_insert the tuple into */
-	if (mtstate->mt_partition_dispatch_info)
+	if (mtstate->mt_ptrinfos)
 	{
 		int			leaf_part_index;
 		TupleConversionMap *map;
@@ -291,7 +291,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
-											mtstate->mt_partition_dispatch_info,
+											mtstate->mt_ptrinfos,
 											slot,
 											estate);
 		Assert(leaf_part_index >= 0 &&
@@ -1486,7 +1486,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		int		numResultRelInfos;
 
 		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (mtstate->mt_ptrinfos != NULL)
 		{
 			/*
 			 * For INSERT via partitioned table, so we need TupleDescs based
@@ -1906,7 +1906,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (operation == CMD_INSERT &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
+		PartitionTupleRoutingInfo **ptrinfos;
 		ResultRelInfo *partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
@@ -1915,13 +1915,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		ExecSetupPartitionTupleRouting(rel,
 									   node->nominalRelation,
-									   &partition_dispatch_info,
+									   &ptrinfos,
 									   &partitions,
 									   &partition_tupconv_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
-		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-		mtstate->mt_num_dispatch = num_parted;
+		mtstate->mt_ptrinfos = ptrinfos;
+		mtstate->mt_num_parted = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
 		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
@@ -2328,19 +2328,16 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 
 	/*
-	 * Close all the partitioned tables, leaf partitions, and their indices
+	 * Close all the leaf partitions and their indices.
 	 *
-	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
-	 * partitioned table, which we must not try to close, because it is the
-	 * main target table of the query that will be closed by ExecEndPlan().
-	 * Also, tupslot is NULL for the root partitioned table.
+	 * node->mt_partition_dispatch_info[0] corresponds to the root partitioned
+	 * table, for which we didn't create tupslot.
 	 */
-	for (i = 1; i < node->mt_num_dispatch; i++)
+	for (i = 1; i < node->mt_num_parted; i++)
 	{
-		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+		PartitionTupleRoutingInfo *ptrinfo = node->mt_ptrinfos[i];
 
-		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
+		ExecDropSingleTupleTableSlot(ptrinfo->tupslot);
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index f10879a162..50f5574831 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -39,36 +39,23 @@ typedef struct PartitionDescData
 
 typedef struct PartitionDescData *PartitionDesc;
 
-/*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- *	reldesc		Relation descriptor of the table
- *	key			Partition key information of the table
- *	keystate	Execution state required for expressions in the partition key
- *	partdesc	Partition descriptor of the table
- *	tupslot		A standalone TupleTableSlot initialized with this table's tuple
- *				descriptor
- *	tupmap		TupleConversionMap to convert from the parent's rowtype to
- *				this table's rowtype (when extracting the partition key of a
- *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				RelationGetPartitionDispatchInfo())
- *-----------------------
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * Information about one partitioned table in a given partition tree
  */
-typedef struct PartitionDispatchData
+typedef struct PartitionedTableInfo
 {
-	Relation	reldesc;
-	PartitionKey key;
-	List	   *keystate;		/* list of ExprState */
-	PartitionDesc partdesc;
-	TupleTableSlot *tupslot;
-	TupleConversionMap *tupmap;
-	int		   *indexes;
-} PartitionDispatchData;
+	Oid				relid;
+	Oid				parentid;
 
-typedef struct PartitionDispatchData *PartitionDispatch;
+	/*
+	 * This contains information about bounds of the partitions of this
+	 * table and about where individual partitions are placed in the global
+	 * partition tree.
+	 */
+	PartitionDispatch pd;
+} PartitionedTableInfo;
 
 extern void RelationBuildPartitionDesc(Relation relation);
 extern bool partition_bounds_equal(PartitionKey key,
@@ -84,18 +71,18 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
+								 List **ptinfos, List **leaf_part_oids);
+
 /* For tuple routing */
-extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int lockmode, int *num_parted,
-								 List **leaf_part_oids);
-extern void FormPartitionKeyDatum(PartitionDispatch pd,
+extern void FormPartitionKeyDatum(PartitionKeyInfo *keyinfo,
 					  TupleTableSlot *slot,
 					  EState *estate,
 					  Datum *values,
 					  bool *isnull);
-extern int get_partition_for_tuple(PartitionDispatch *pd,
+extern int get_partition_for_tuple(PartitionTupleRoutingInfo **pd,
 						TupleTableSlot *slot,
 						EState *estate,
-						PartitionDispatchData **failed_at,
+						PartitionTupleRoutingInfo **failed_at,
 						TupleTableSlot **failed_slot);
 #endif							/* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 59c28b709e..6e5f55c06d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -208,13 +208,13 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
 							   Index resultRTindex,
-							   PartitionDispatch **pd,
+							   PartitionTupleRoutingInfo ***ptrinfos,
 							   ResultRelInfo **partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+				  PartitionTupleRoutingInfo **ptrinfos,
 				  TupleTableSlot *slot,
 				  EState *estate);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 85fac8ab91..e7bd8617bd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -414,6 +414,55 @@ typedef struct ResultRelInfo
 	Relation	ri_PartitionRoot;
 } ResultRelInfo;
 
+/* Forward declarations, to avoid including other headers */
+typedef struct PartitionKeyData *PartitionKey;
+typedef struct PartitionDispatchData *PartitionDispatch;
+
+/*
+ * PartitionKeyInfoData - execution state for the partition key of a
+ *						  partitioned table
+ *
+ * keystate is the execution state required for expressions contained in the
+ * partition key.  It is NIL until initialized by FormPartitionKeyDatum() if
+ * and when it is called; for example, during tuple routing through a given
+ * partitioned table.
+ */
+typedef struct PartitionKeyInfo
+{
+	PartitionKey	key;		/* Points into the table's relcache entry */
+	List		   *keystate;
+} PartitionKeyInfo;
+
+/*
+ * PartitionTupleRoutingInfo - information required for tuple-routing
+ *							   through one partitioned table in a partition
+ *							   tree
+ */
+typedef struct PartitionTupleRoutingInfo
+{
+	/* OID of the table */
+	Oid				relid;
+
+	/* Information about the table's partitions */
+	PartitionDispatch	pd;
+
+	/* See comment above the definition of PartitionKeyInfo */
+	PartitionKeyInfo   *keyinfo;
+
+	/*
+	 * A standalone TupleTableSlot initialized with this table's tuple
+	 * descriptor
+	 */
+	TupleTableSlot *tupslot;
+
+	/*
+	 * TupleConversionMap to convert from the parent's rowtype to this table's
+	 * rowtype (when extracting the partition key of a tuple just before
+	 * routing it through this table)
+	 */
+	TupleConversionMap *tupmap;
+} PartitionTupleRoutingInfo;
+
 /* ----------------
  *	  EState information
  *
@@ -954,9 +1003,9 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
-	struct PartitionDispatchData **mt_partition_dispatch_info;
 	/* Tuple-routing support info */
-	int			mt_num_dispatch;	/* Number of entries in the above array */
+	struct PartitionTupleRoutingInfo **mt_ptrinfos;
+	int			mt_num_parted;		/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-- 
2.11.0

0002-Teach-expand_inherited_rtentry-to-add-partitions-in-.patchtext/plain; charset=UTF-8; name=0002-Teach-expand_inherited_rtentry-to-add-partitions-in-.patchDownload

From 6fc272c637ba1b49f7ac0cba242f997656a3a4ea Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Wed, 26 Jul 2017 13:26:58 +0900
Subject: [PATCH 2/3] Teach expand_inherited_rtentry() to add partitions in
 bound order

---
 src/backend/optimizer/prep/prepunion.c | 45 ++++++++++++++++++++++++++--------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74782..b327dd9ebc 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -1370,7 +1371,8 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	Oid			parentOID;
 	PlanRowMark *oldrc;
 	Relation	oldrelation;
-	LOCKMODE	lockmode;
+	LOCKMODE	lockmode,
+				child_lockmode;
 	List	   *inhOIDs;
 	List	   *appinfos;
 	ListCell   *l;
@@ -1417,8 +1419,35 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	else
 		lockmode = AccessShareLock;
 
+	child_lockmode = lockmode;
+
+	/*
+	 * Must open the parent relation to examine its tupdesc.  We need not lock
+	 * it; we assume the rewriter already did.
+	 */
+	oldrelation = heap_open(parentOID, NoLock);
+
 	/* Scan for all members of inheritance set, acquire needed locks */
-	inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+	if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+	{
+		inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+		child_lockmode = NoLock;	/* No need to lock the tables in inhOIDs */
+	}
+	else
+	{
+		List   *ptinfos,
+			   *tmp = NIL;
+
+		RelationGetPartitionDispatchInfo(oldrelation, lockmode,
+										 &ptinfos, &inhOIDs);
+		foreach(l, ptinfos)
+		{
+			PartitionedTableInfo   *ptinfo = lfirst(l);
+
+			tmp = lappend_oid(tmp, ptinfo->relid);
+		}
+		inhOIDs = list_concat(tmp, inhOIDs);
+	}
 
 	/*
 	 * Check that there's at least one descendant, else treat as no-child
@@ -1429,6 +1458,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	{
 		/* Clear flag before returning */
 		rte->inh = false;
+		heap_close(oldrelation, NoLock);
 		return;
 	}
 
@@ -1440,12 +1470,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (oldrc)
 		oldrc->isParent = true;
 
-	/*
-	 * Must open the parent relation to examine its tupdesc.  We need not lock
-	 * it; we assume the rewriter already did.
-	 */
-	oldrelation = heap_open(parentOID, NoLock);
-
 	/* Scan the inheritance set and expand it */
 	appinfos = NIL;
 	need_append = false;
@@ -1457,9 +1481,9 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Index		childRTindex;
 		AppendRelInfo *appinfo;
 
-		/* Open rel if needed; we already have required locks */
+		/* Open rel if needed, taking a lock if a partition (see above) */
 		if (childOID != parentOID)
-			newrelation = heap_open(childOID, NoLock);
+			newrelation = heap_open(childOID, child_lockmode);
 		else
 			newrelation = oldrelation;
 
@@ -1471,6 +1495,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		 */
 		if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation))
 		{
+			/* Note that not using child_lockmode here. */
 			heap_close(newrelation, lockmode);
 			continue;
 		}
-- 
2.11.0

#131

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 8 years ago

In reply to: Robert Haas (#129)

Re: UPDATE of partition key

On 2017/07/26 6:07, Robert Haas wrote:

On Thu, Jul 13, 2017 at 1:09 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is a WIP patch (make_resultrels_ordered.patch) that generates
the result rels in canonical order. This patch is kept separate from
the update-partition-key patch, and can be applied on master branch.

Thank you for working on this, Amit!

Hmm, I like the approach you've taken here in general,

+1 for the approach.

Is there any real benefit in this "walker" interface? It looks to me
like it might be simpler to just change things around so that it
returns a list of OIDs, like find_all_inheritors, but generated
differently. Then if you want bound-ordering rather than
OID-ordering, you just do this:

list_free(inhOids);
inhOids = get_partition_oids_in_bound_order(rel);

That'd remove the need for some if/then logic as you've currently got
in get_next_child().

Yeah, that would make the code much simple, so +1 for Robert's idea.

I think we should always expand in bound order rather than only when
it's a result relation. I think for partition-wise join, we're going
to want to do it this way for all relations in the query, or at least
for all relations in the query that might possibly be able to
participate in a partition-wise join. If there are multiple cases
that are going to need this ordering, it's hard for me to accept the
idea that it's worth the complexity of trying to keep track of when we
expanded things in one order vs. another. There are other
applications of having things in bound order too, like MergeAppend ->
Append strength-reduction (which might not be legal anyway if there
are list partitions with multiple, non-contiguous list bounds or if
any NULL partition doesn't end up in the right place in the order, but
there will be lots of cases where it can work).

+1 for that as well. Another benefit from that would be EXPLAIN; we
could display partitions for a partitioned table in the same order for
Append and ModifyTable (ie, SELECT/UPDATE/DELETE), which I think would
make the EXPLAIN result much readable.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#132

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Rajkumar Raghuwanshi (#128)

Re: UPDATE of partition key

On 2017/07/25 21:55, Rajkumar Raghuwanshi wrote:

Got one more observation : update... returning is not working with whole
row reference. please take a look.

postgres=# create table part (a int, b int) partition by range(a);
CREATE TABLE
postgres=# create table part_p1 partition of part for values from
(minvalue) to (0);
CREATE TABLE
postgres=# create table part_p2 partition of part for values from (0) to
(maxvalue);
CREATE TABLE
postgres=# insert into part values (10,1);
INSERT 0 1
postgres=# insert into part values (20,2);
INSERT 0 1
postgres=# update part t1 set a = b returning t1;
ERROR: unexpected whole-row reference found in partition key

That looks like a bug which exists in HEAD too. I posted a patch in a
dedicated thread to address the same [1]/messages/by-id/9a39df80-871e-6212-0684-f93c83be4097@lab.ntt.co.jp.

Thanks,
Amit

[1]: /messages/by-id/9a39df80-871e-6212-0684-f93c83be4097@lab.ntt.co.jp
/messages/by-id/9a39df80-871e-6212-0684-f93c83be4097@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#129)

Re: UPDATE of partition key

On 26 July 2017 at 02:37, Robert Haas <robertmhaas@gmail.com> wrote:

Is there any real benefit in this "walker" interface? It looks to me
like it might be simpler to just change things around so that it
returns a list of OIDs, like find_all_inheritors, but generated
differently. Then if you want bound-ordering rather than
OID-ordering, you just do this:

list_free(inhOids);
inhOids = get_partition_oids_in_bound_order(rel);

That'd remove the need for some if/then logic as you've currently got
in get_next_child().

Yes, I had considered that ; i.e., first generating just a list of
bound-ordered oids. But that consequently needs all the child tables
to be opened and closed twice; once during the list generation, and
then while expanding the partitioned table. Agreed, that the second
time, heap_open() would not be that expensive because tables would be
cached, but still it would require to get the cached relation handle
from hash table. Since we anyway want to open the tables, better have
a *next() function to go-get the next partition in a fixed order.

Actually, there isn't much that the walker next() function does. Any
code that wants to traverse bound-wise can do that by its own. The
walker function is just a convenient way to make sure everyone
traverses in the same order by using this function.

Yet to go over other things including your review comments, and Amit
Langote's patch on refactoring RelationGetPartitionDispatchInfo().

On another note, did you do anything about the suggestion Thomas made
in /messages/by-id/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com
?

This is still pending on me; plus I think there are some more points.
I need to go over those and consolidate a list of todos.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Langote (#130)

Re: UPDATE of partition key

On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Sorry to be responding this late to the Amit's make_resultrel_ordered
patch itself, but I agree that we should teach the planner to *always*
expand partitioned tables in the partition bound order.

Sounds like we have unanimous agreement on that point. Yesterday, I
was discussing with Beena Emerson, who is working on run-time
partition pruning, that it would also be useful for that purpose, if
you're trying to prune based on a range query.

I checked that we get the same result relation order with both the
patches, but I would like to highlight a notable difference here between
the approaches taken by our patches. In my patch, I have now taught
RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
in the tree, because we need to look at its partition descriptor to
collect partition OIDs and bounds. We can defer locking (and opening the
relation descriptor of) leaf partitions to a point where planner has
determined that the partition will be accessed after all (not pruned),
which will be done in a separate patch of course.

That's very desirable, but I believe it introduces a deadlock risk
which Amit's patch avoids. A transaction using the code you've
written here is eventually going to lock all partitions, BUT it's
going to move the partitioned ones to the front of the locking order
vs. what find_all_inheritors would do. So, when multi-level
partitioning is in use, I think it could happen that some other
transaction is accessing the table using a different code path that
uses the find_all_inheritors order without modification. If those
locks conflict (e.g. query vs. DROP) then there's a deadlock risk.

Unfortunately I don't see any easy way around that problem, but maybe
somebody else has an idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#135

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#134)

Re: UPDATE of partition key

On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Sorry to be responding this late to the Amit's make_resultrel_ordered
patch itself, but I agree that we should teach the planner to *always*
expand partitioned tables in the partition bound order.

Sounds like we have unanimous agreement on that point.

I too agree.

I checked that we get the same result relation order with both the
patches, but I would like to highlight a notable difference here between
the approaches taken by our patches. In my patch, I have now taught
RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
in the tree, because we need to look at its partition descriptor to
collect partition OIDs and bounds. We can defer locking (and opening the
relation descriptor of) leaf partitions to a point where planner has
determined that the partition will be accessed after all (not pruned),
which will be done in a separate patch of course.

With Amit Langote's patch, we can very well do the locking beforehand
by find_all_inheritors(), and then run
RelationGetPartitionDispatchInfo() with noLock, so as to remove the
deadlock problem. But I think we should keep these two tasks separate,
i.e. expanding the partition tree in bound order, and making
RelationGetPartitionDispatchInfo() work for the planner.

Regarding building the PartitionDispatchInfo in the planner, we should
do that only after it is known that partition columns are updated, so
it can't be done in expand_inherited_rtentry() because it would be too
soon. For planner setup, RelationGetPartitionDispatchInfo() should
just build the tupmap for each partitioned table, and then initialize
the rest of the fields like tuplslot, reldesc , etc later during
execution.

So for now, I feel we should just do the changes for making sure the
order is same, and then over that, separately modify
RelationGetPartitionDispatchInfo() for planner.

That's very desirable, but I believe it introduces a deadlock risk
which Amit's patch avoids. A transaction using the code you've
written here is eventually going to lock all partitions, BUT it's
going to move the partitioned ones to the front of the locking order
vs. what find_all_inheritors would do. So, when multi-level
partitioning is in use, I think it could happen that some other
transaction is accessing the table using a different code path that
uses the find_all_inheritors order without modification. If those
locks conflict (e.g. query vs. DROP) then there's a deadlock risk.

Yes, I agree. Even with single-level partitioning, the leaf partitions
ordered by find_all_inheritors() is by oid values, so that's also
going to be differently ordered.

Unfortunately I don't see any easy way around that problem, but maybe
somebody else has an idea.

One approach I had considered was to have find_inheritance_children()
itself lock the children in bound order, so that everyone will have
bound-ordered oids, but that would be too expensive since it requires
opening all partitioned tables to initialize partition descriptors. In
find_inheritance_children(), we get all oids without opening any
tables. But now that I think more of it, it's only the partitioned
tables that we have to open, not the leaf partitions; and furthermore,
I didn't see calls to find_inheritance_children() and
find_all_inheritors() in performance-critical code, except in
expand_inherited_rtentry(). All of them are in DDL commands; but yes,
that can change in the future.

Regarding dynamically locking specific partitions as and when needed,
I think this method inherently has the issue of deadlock because the
order would be random. So it feels like there is no way around other
than to lock all partitions beforehand.

----------------

Regarding using first resultrel for mapping RETURNING and WCO, I think
we can use (a renamed) getASTriggerResultRelInfo() to get the root
result relation, and use WCO and RETURNING expressions of this
relation to do the mapping for child rels. This way, there won't be
insert/update specific code, and we don't need to use first result
relation.

While checking the whole-row bug on the other thread [1]/messages/by-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983@lab.ntt.co.jp , I noticed
that the RETURNING/WCO expressions for the per-subplan result rels are
formed by considering not just simple vars, but also whole row vars
and other nodes. So for update-tuple-routing, there would be some
result-rels WCOs formed using adjust_appendrel_attrs(), while for
others, they would be built using map_partition_varattnos() which only
considers simple vars. So the bug in [1]/messages/by-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983@lab.ntt.co.jp would be there for
update-partition-key as well, when the tuple is routed into a newly
built resultrel. May be, while fixing the bug in [1]/messages/by-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983@lab.ntt.co.jp , this might be
automatically solved.

----------------

Below are the TODOS at this point :

Fix for bug reported by Rajkumar about update with join.
Do something about two separate mapping tables for Transition tables
and update tuple-routing.
GetUpdatedColumns() to be moved to header file.
More test scenarios in regression tests.
Need to check/test whether we are correctly applying insert policies
(ant not update) while inserting a routed tuple.
Use getASTriggerResultRelInfo() for attrno mapping, rather than first
resultrel, for generating child WCO/RETURNING expression.
Address Robert's review comments on make_resultrel_ordered.patch.
pgindent.

[1]: /messages/by-id/d86d27ea-cc9d-5dbe-b131-e7dec4017983@lab.ntt.co.jp

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Amit Khandekar (#135)

Re: UPDATE of partition key

On 2017/07/29 2:45, Amit Khandekar wrote:

On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:

I checked that we get the same result relation order with both the
patches, but I would like to highlight a notable difference here between
the approaches taken by our patches. In my patch, I have now taught
RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
in the tree, because we need to look at its partition descriptor to
collect partition OIDs and bounds. We can defer locking (and opening the
relation descriptor of) leaf partitions to a point where planner has
determined that the partition will be accessed after all (not pruned),
which will be done in a separate patch of course.

That's very desirable, but I believe it introduces a deadlock risk
which Amit's patch avoids. A transaction using the code you've
written here is eventually going to lock all partitions, BUT it's
going to move the partitioned ones to the front of the locking order
vs. what find_all_inheritors would do. So, when multi-level
partitioning is in use, I think it could happen that some other
transaction is accessing the table using a different code path that
uses the find_all_inheritors order without modification. If those
locks conflict (e.g. query vs. DROP) then there's a deadlock risk.

Yes, I agree. Even with single-level partitioning, the leaf partitions
ordered by find_all_inheritors() is by oid values, so that's also
going to be differently ordered.

We do require to lock the parent first in any case. Doesn't that prevent
deadlocks by imparting an implicit order on locking by operations whose
locks conflict.

Having said that, I think it would be desirable for all code paths to
manipulate partitions in the same order. For partitioned tables, I think
we can make it the partition bound order by replacing all calls to
find_all_inheritors and find_inheritance_children on partitioned table
parents with something else that reads partition OIDs from the relcache
(PartitionDesc) and traverses the partition tree breadth-first manner.

Unfortunately I don't see any easy way around that problem, but maybe
somebody else has an idea.

One approach I had considered was to have find_inheritance_children()
itself lock the children in bound order, so that everyone will have
bound-ordered oids, but that would be too expensive since it requires
opening all partitioned tables to initialize partition descriptors. In
find_inheritance_children(), we get all oids without opening any
tables. But now that I think more of it, it's only the partitioned
tables that we have to open, not the leaf partitions; and furthermore,
I didn't see calls to find_inheritance_children() and
find_all_inheritors() in performance-critical code, except in
expand_inherited_rtentry(). All of them are in DDL commands; but yes,
that can change in the future.

This approach more or less amounts to calling the new
RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
which I posted upthread.) Maybe we can add a wrapper on top, say,
get_all_partition_oids() which throws away other things that
RelationGetPartitionDispatchInfo() returned. In addition it locks all the
partitions that are returned, unlike only the partitioned ones, which is
what RelationGetPartitionDispatchInfo() has been taught to do.

Regarding dynamically locking specific partitions as and when needed,
I think this method inherently has the issue of deadlock because the
order would be random. So it feels like there is no way around other
than to lock all partitions beforehand.

I'm not sure why the order has to be random. If and when we decide to
open and lock a subset of partitions for a given query, it will be done in
some canonical order as far as I can imagine. Do you have some specific
example in mind?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#137

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Langote (#136)

Re: UPDATE of partition key

On 2 August 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/07/29 2:45, Amit Khandekar wrote:

On 28 July 2017 at 20:10, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 26, 2017 at 2:13 AM, Amit Langote wrote:

I checked that we get the same result relation order with both the
patches, but I would like to highlight a notable difference here between
the approaches taken by our patches. In my patch, I have now taught
RelationGetPartitionDispatchInfo() to lock *only* the partitioned tables
in the tree, because we need to look at its partition descriptor to
collect partition OIDs and bounds. We can defer locking (and opening the
relation descriptor of) leaf partitions to a point where planner has
determined that the partition will be accessed after all (not pruned),
which will be done in a separate patch of course.

That's very desirable, but I believe it introduces a deadlock risk
which Amit's patch avoids. A transaction using the code you've
written here is eventually going to lock all partitions, BUT it's
going to move the partitioned ones to the front of the locking order
vs. what find_all_inheritors would do. So, when multi-level
partitioning is in use, I think it could happen that some other
transaction is accessing the table using a different code path that
uses the find_all_inheritors order without modification. If those
locks conflict (e.g. query vs. DROP) then there's a deadlock risk.

Yes, I agree. Even with single-level partitioning, the leaf partitions
ordered by find_all_inheritors() is by oid values, so that's also
going to be differently ordered.

We do require to lock the parent first in any case. Doesn't that prevent
deadlocks by imparting an implicit order on locking by operations whose
locks conflict.

Yes may be, but I am not too sure at this point. find_all_inheritors()
locks only the children, and the parent lock is already locked
separately. find_all_inheritors() does not necessitate to lock the
children with the same lockmode as the parent.

Having said that, I think it would be desirable for all code paths to
manipulate partitions in the same order. For partitioned tables, I think
we can make it the partition bound order by replacing all calls to
find_all_inheritors and find_inheritance_children on partitioned table
parents with something else that reads partition OIDs from the relcache
(PartitionDesc) and traverses the partition tree breadth-first manner.

Unfortunately I don't see any easy way around that problem, but maybe
somebody else has an idea.

One approach I had considered was to have find_inheritance_children()
itself lock the children in bound order, so that everyone will have
bound-ordered oids, but that would be too expensive since it requires
opening all partitioned tables to initialize partition descriptors. In
find_inheritance_children(), we get all oids without opening any
tables. But now that I think more of it, it's only the partitioned
tables that we have to open, not the leaf partitions; and furthermore,
I didn't see calls to find_inheritance_children() and
find_all_inheritors() in performance-critical code, except in
expand_inherited_rtentry(). All of them are in DDL commands; but yes,
that can change in the future.

This approach more or less amounts to calling the new
RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
which I posted upthread.) Maybe we can add a wrapper on top, say,
get_all_partition_oids() which throws away other things that
RelationGetPartitionDispatchInfo() returned. In addition it locks all the
partitions that are returned, unlike only the partitioned ones, which is
what RelationGetPartitionDispatchInfo() has been taught to do.

So there are three different task items here :
1. Arrange the oids in consistent order everywhere.
2. Prepare the Partition Dispatch Info data structure in the planner
as against during execution.
3. For update tuple routing, assume that the result rels are ordered
consistently to make the searching efficient.

#3 depends on #1. So for that, I have come up with a minimum set of
changes to have expand_inherited_rtentry() generate the rels in bound
order. When we do #2 , it may be possible that we may need to re-do my
changes in expand_inherited_rtentry(), but those are minimum. We may
even end up having the walker function being used at multiple places,
but right now it is not certain.

So, I think we can continue the discussion about #1 and #2 in a separate thread.

Regarding dynamically locking specific partitions as and when needed,
I think this method inherently has the issue of deadlock because the
order would be random. So it feels like there is no way around other
than to lock all partitions beforehand.

I'm not sure why the order has to be random. If and when we decide to
open and lock a subset of partitions for a given query, it will be done in
some canonical order as far as I can imagine. Do you have some specific
example in mind?

Partitioned table t1 has partitions t1p1 and t1p2
Partitioned table t2 at the same level has partitions t2p1 and t2p2
Tuple routing causes the first row to insert into t2p2, so t2p2 is locked.
Next insert locks t1p1 because it inserts into t1p1.
But at the same time, somebody does DDL on some parent common to t1
and t2, so it locks the leaf partitions in a fixed specific order,
which would be different than the insert lock order because that order
depended upon the order of tables that the insert rows were routed to.

Thanks,
Amit

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#138

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Amit Khandekar (#137)

Re: UPDATE of partition key

On 2017/08/02 19:49, Amit Khandekar wrote:

On 2 August 2017 at 14:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

One approach I had considered was to have find_inheritance_children()
itself lock the children in bound order, so that everyone will have
bound-ordered oids, but that would be too expensive since it requires
opening all partitioned tables to initialize partition descriptors. In
find_inheritance_children(), we get all oids without opening any
tables. But now that I think more of it, it's only the partitioned
tables that we have to open, not the leaf partitions; and furthermore,
I didn't see calls to find_inheritance_children() and
find_all_inheritors() in performance-critical code, except in
expand_inherited_rtentry(). All of them are in DDL commands; but yes,
that can change in the future.

This approach more or less amounts to calling the new
RelationGetPartitionDispatchInfo() (per my proposed patch, a version of
which I posted upthread.) Maybe we can add a wrapper on top, say,
get_all_partition_oids() which throws away other things that
RelationGetPartitionDispatchInfo() returned. In addition it locks all the
partitions that are returned, unlike only the partitioned ones, which is
what RelationGetPartitionDispatchInfo() has been taught to do.

So there are three different task items here :
1. Arrange the oids in consistent order everywhere.
2. Prepare the Partition Dispatch Info data structure in the planner
as against during execution.
3. For update tuple routing, assume that the result rels are ordered
consistently to make the searching efficient.

That's a good breakdown.

#3 depends on #1. So for that, I have come up with a minimum set of
changes to have expand_inherited_rtentry() generate the rels in bound
order. When we do #2 , it may be possible that we may need to re-do my
changes in expand_inherited_rtentry(), but those are minimum. We may
even end up having the walker function being used at multiple places,
but right now it is not certain.

So AFAICS:

For performance reasons, we want the order in which leaf partition
sub-plans appear in the ModifyTable node (and subsequently leaf partition
ResultRelInfos ModifyTableState) to be some known canonical order. That's
because we want to map partitions in the insert tuple-routing data
structure (which appear in a known canonical order as determined by
RelationGetPartititionDispatchInfo) to those appearing in the
ModifyTableState. That's so that we can reuse the planner-generated WCO
and RETURNING lists in the insert code path when update tuple-routing
invokes that path.

To implement that, planner should retrieve the list of leaf partition OIDs
in the same order as ExecSetupPartitionTupleRouting() retrieves them.
Because the latter calls RelationGetPartitionDispatchInfo on the root
partitioned table, maybe the planner should do that too, instead of its
current method getting OIDs using find_all_inheritors(). But it's
currently not possible due to the way RelationGetPartitionDispatchInfo()
and involved data structures are designed.

One way forward I see is to invent new interface functions:

List *get_all_partition_oids(Oid, LOCKMODE)
List *get_partition_oids(Oid, LOCKMODE)

that resemble find_all_inheritors() and find_inheritance_children(),
respectively, but expects that users make sure that they are called only
for partitioned tables. Needless to mention, OIDs are returned with
canonical order determined by that of the partition bounds and partition
tree structure. We replace all the calls of the old interface functions
with the respective new ones. That means expand_inherited_rtentry (among
others) now calls get_all_partition_oids() if the RTE is for a partitioned
table and find_all_inheritors() otherwise.

So, I think we can continue the discussion about #1 and #2 in a separate thread.

I have started a new thread named "expanding inheritance in partition
bound order" and posted a couple of patches [1]/messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp.

After applying those patches, you can write code for #3 without having to
worry about the concerns of partition order, which I guess you've already
done.

Regarding dynamically locking specific partitions as and when needed,
I think this method inherently has the issue of deadlock because the
order would be random. So it feels like there is no way around other
than to lock all partitions beforehand.

I'm not sure why the order has to be random. If and when we decide to
open and lock a subset of partitions for a given query, it will be done in
some canonical order as far as I can imagine. Do you have some specific
example in mind?

Partitioned table t1 has partitions t1p1 and t1p2
Partitioned table t2 at the same level has partitions t2p1 and t2p2
Tuple routing causes the first row to insert into t2p2, so t2p2 is locked.
Next insert locks t1p1 because it inserts into t1p1.
But at the same time, somebody does DDL on some parent common to t1
and t2, so it locks the leaf partitions in a fixed specific order,
which would be different than the insert lock order because that order
depended upon the order of tables that the insert rows were routed to.

Note that we don't currently do this. That is, lock partitions in an
order determined by incoming rows. ExecSetupPartitionTupleRouting() locks
(RowExclusiveLock) all the partitions beforehand in the partition bound
order. Any future patch that wants to delay locking and opening the
relation descriptor of a leaf partition to when a tuple is actually routed
to it will have to think hard about the deadlock problem you illustrate above.

Aside from the insert case, let's consider locking order when planning a
select on a partitioned table. We currently lock all the partitions in
advance in expand_inherited_rtentry(). When replacing the current method
by some new way, we will first determine all the partitions that satisfy a
given query, collect them in an ordered list (some fixed canonical order),
and lock them in that order.

But maybe, I misunderstood what you said?

Thanks,
Amit

[1]: /messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp
/messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#139

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#135)

1 attachment(s)

Re: UPDATE of partition key

Below are the TODOS at this point :

Fix for bug reported by Rajkumar about update with join.

I had explained the root issue of this bug here : [1]. /messages/by-id/CAKcux6=z38gH4K6YAFi+Yvo5tHTwBL4tam4VM33CAPZ5dDMk1Q@mail.gmail.com

Attached patch includes the fix, which is explained below.
Currently in the patch, there is a check if the tuple is concurrently
deleted by other session, i.e. when heap_update() returns
HeapTupleUpdated. In such case we set concurrently_deleted output
param to true. We should also do the same for HeapTupleSelfUpdated
return value.

In fact, there are other places in ExecDelete() where it can return
without doing anything. For e.g. if a BR DELETE trigger prevents the
delete from happening, ExecBRDeleteTriggers() returns false, in which
case ExecDelete() returns.

So what the fix does is : rename concurrently_deleted parameter to
delete_skipped so as to indicate a more general status : whether
delete has actually happened or was it skipped. And set this param to
true only after the delete happens. This allows us to avoid adding a
new rows for the trigger case also.

Added test scenario for UPDATE with JOIN case, and also TRIGGER case.

Do something about two separate mapping tables for Transition tables
and update tuple-routing.

On 1 July 2017 at 03:15, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Would make sense to have a set of functions with names like
GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
m_convertors_{from,to}_by_{subplan,leaf} the first time they need
them?

This was discussed here : [2]/messages/by-id/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com. I think even if we have them built when
needed, still in presence of both tuple routing and transition tables,
we do need separate arrays. So I think rather than dynamic arrays, we
can have static arrays but their elements will point to a shared
TupleConversionMap structure whenever possible.
As already in the patch, in case of insert/update tuple routing, there
is a per-leaf partition mt_transition_tupconv_maps array for
transition tables, and a separate per-subplan arry mt_resultrel_maps
for update tuple routing. *But*, what I am proposing is: for the
mt_transition_tupconv_maps[] element for which the leaf partition also
exists as a per-subplan result, that array element and the
mt_resultrel_maps[] element will point to the same TupleConversionMap
structure.

This is quite similar to how we are re-using the per-subplan
resultrels for the per-leaf result rels. We will re-use the
per-subplan TupleConversionMap for the per-leaf
mt_transition_tupconv_maps[] elements.

Not yet implemented this.

GetUpdatedColumns() to be moved to header file.

Done. I have moved it in execnodes.h

More test scenarios in regression tests.
Need to check/test whether we are correctly applying insert policies
(ant not update) while inserting a routed tuple.

Yet to do above two.

Use getASTriggerResultRelInfo() for attrno mapping, rather than first
resultrel, for generating child WCO/RETURNING expression.

Regarding generating child WithCheckOption and Returning expressions
using those of the root result relation, ModifyTablePath and
ModifyTable should have new fields rootReturningList (and
rootWithCheckOptions) which would be derived from
root->parse->returningList in inheritance_planner(). But then, similar
to per-subplan returningList, rootReturningList would have to pass
through set_plan_refs()=>set_returning_clause_references() which
requires the subplan targetlist to be passed. Because of this, for
rootReturningList, we require a subplan for root partition, which is
not there currently; we have subpans only for child rels. That means
we would have to create such plan only for the sake of generating
rootReturningList.

The other option is to do the way the patch is currently doing in the
executor by using the returningList of the first per-subplan result
rel to generate the other child returningList (and WithCheckOption).
This is working by applying map_partition_varattnos() to the first
returningList. But now that we realized that we have to specially
handle whole-row vars, map_partition_varattnos() would need some
changes to convert whole row vars differently for
child-rel-to-child-rel mapping. For childrel-to-childrel conversion,
the whole-row var is already wrapped by ConvertRowtypeExpr, but we
need to change its Var->vartype to the new child vartype.

I think the second option looks easier, but I am open to suggestions,
and I am myself still checking the first one.

Address Robert's review comments on make_resultrel_ordered.patch.

+typedef struct ParentChild

This is a pretty generic name. Pick something more specific and informative.

I have used ChildPartitionInfo. But suggestions welcome.

+static List *append_rel_partition_oids(List *rel_list, Relation rel);

One could be forgiven for thinking that this function was just going
to append OIDs, but it actually appends ParentChild structures, so I
think the name needs work.

Renamed it to append_child_partitions().

+List *append_rel_partition_oids(List *rel_list, Relation rel)

Style. Please pgindent your patches.

I have pgindent'ed changes in nodeModifyTable.c and partition.c, yet
to do that for others.

+#ifdef DEBUG_PRINT_OIDS
+    print_oids(*leaf_part_oids);
+#endif
I'd just rip out this debug stuff once you've got this working, but if
we keep it, it certainly can't have a name as generic as print_oids()
when it's actually doing something with a list of ParentChild
structures. Also, it prints names, not OIDs. And DEBUG_PRINT_OIDS is
no good for the same reasons.

Now that I have tested it , I have removed this. Also, the ordered
subplans printed in explain output serve the same purpose.

+    if (RelationGetPartitionDesc(rel))
+        walker->rels_list = append_rel_partition_oids(walker->rels_list, rel);
Every place that calls append_rel_partition_oids guards that call with
if (RelationGetPartitionDesc(...)). It seems to me that it would be
simpler to remove those tests and instead just replace the
Assert(partdesc) inside that function with if (!partdesc) return;

Done.

Is there any real benefit in this "walker" interface? It looks to me
like it might be simpler to just change things around so that it
returns a list of OIDs, like find_all_inheritors, but generated
differently. Then if you want bound-ordering rather than
OID-ordering, you just do this:

list_free(inhOids);
inhOids = get_partition_oids_in_bound_order(rel);

That'd remove the need for some if/then logic as you've currently got
in get_next_child().

Have explained this here :
/messages/by-id/CAJ3gD9dQ2FKes8pP6aM-4Tx3ngqWvD8oyOJiDRxLVoQiY76t0A@mail.gmail.com
I am aware that this might get changed once we checkin a separate
patch just floated to expand inheritence in bound order.

+    is_partitioned_resultrel =
+        (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE
+         && rti == parse->resultRelation);
I suspect this isn't correct for a table that contains wCTEs, because
there would in that case be multiple result relations.

I think we should always expand in bound order rather than only when
it's a result relation.

Have changed it to always expand in bound order for partitioned table.

[1]: . /messages/by-id/CAKcux6=z38gH4K6YAFi+Yvo5tHTwBL4tam4VM33CAPZ5dDMk1Q@mail.gmail.com

[2]: /messages/by-id/CAEepm=3sc_j1zwqDYrbU4DTfX5rHcaMNNuaXRKWZFgt9m23OcA@mail.gmail.com

Attachments:

update-partition-key_v14.patchapplication/octet-stream; name=update-partition-key_v14.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index dcc7f8a..9feaa8c 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,16 @@ typedef struct PartitionRangeBound
 	bool		lower;			/* this is the lower (vs upper) bound */
 } PartitionRangeBound;
 
+/*
+ * List of these elements is prepared while traversing a partition tree,
+ * so as to get a consistent order of partitions.
+ */
+typedef struct ChildPartitionInfo
+{
+	Oid			reloid;
+	Relation	parent;			/* Parent relation of reloid */
+}			ChildPartitionInfo;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -140,6 +150,8 @@ static int partition_bound_bsearch(PartitionKey key,
 						PartitionBoundInfo boundinfo,
 						void *probe, bool probe_is_bound, bool *is_equal);
 
+static List *append_child_partitions(List *rel_list, Relation rel);
+
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -893,7 +905,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -906,8 +919,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	AttrNumber *part_attnos;
@@ -916,14 +929,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 											 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
-										RelationGetForm(partrel)->reltype,
+										RelationGetDescr(from_rel)->natts,
+										RelationGetForm(to_rel)->reltype,
 										&my_found_whole_row);
 	if (found_whole_row)
 		*found_whole_row = my_found_whole_row;
@@ -976,21 +989,6 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-	do\
-	{\
-		int		i;\
-		for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-		{\
-			(partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
-			(parents) = lappend((parents), (rel));\
-		}\
-	} while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
  *		Returns information necessary to route tuples down a partition tree
  *
@@ -1002,11 +1000,13 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 								 int *num_parted, List **leaf_part_oids)
 {
+	PartitionWalker walker;
 	PartitionDispatchData **pd;
-	List	   *all_parts = NIL,
-			   *all_parents = NIL,
-			   *parted_rels,
+	Relation	partrel;
+	Relation	parent;
+	List	   *parted_rels,
 			   *parted_rel_parents;
+	List	   *inhOIDs;
 	ListCell   *lc1,
 			   *lc2;
 	int			i,
@@ -1017,21 +1017,28 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 	 * Lock partitions and make a list of the partitioned ones to prepare
 	 * their PartitionDispatch objects below.
 	 *
-	 * Cannot use find_all_inheritors() here, because then the order of OIDs
-	 * in parted_rels list would be unknown, which does not help, because we
-	 * assign indexes within individual PartitionDispatch in an order that is
-	 * predetermined (determined by the order of OIDs in individual partition
-	 * descriptors).
+	 * Must call find_all_inheritors() here so as to lock the partitions in a
+	 * consistent order (by oid values) to prevent deadlocks. But we assign
+	 * indexes within individual PartitionDispatch in a different order
+	 * (determined by the order of OIDs in individual partition descriptors).
+	 * So, rather than using the oids returned by find_all_inheritors(), we
+	 * generate canonically ordered oids using partition walker.
 	 */
+	inhOIDs = find_all_inheritors(RelationGetRelid(rel), lockmode, NULL);
+	list_free(inhOIDs);
+
+	partition_walker_init(&walker, rel);
+	parent = NULL;
 	*num_parted = 1;
 	parted_rels = list_make1(rel);
 	/* Root partitioned table has no parent, so NULL for parent */
 	parted_rel_parents = list_make1(NULL);
-	APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
-	forboth(lc1, all_parts, lc2, all_parents)
+
+	/* Go to the next partition */
+	partrel = partition_walker_next(&walker, &parent);
+
+	for (; partrel != NULL; partrel = partition_walker_next(&walker, &parent))
 	{
-		Relation	partrel = heap_open(lfirst_oid(lc1), lockmode);
-		Relation	parent = lfirst(lc2);
 		PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
 
 		/*
@@ -1043,7 +1050,6 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 			(*num_parted)++;
 			parted_rels = lappend(parted_rels, partrel);
 			parted_rel_parents = lappend(parted_rel_parents, parent);
-			APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
 		}
 		else
 			heap_close(partrel, NoLock);
@@ -2062,6 +2068,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
@@ -2328,3 +2405,84 @@ partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo,
 
 	return lo;
 }
+
+/*
+ * partition_walker_init
+ *
+ * Using the passed partitioned relation, expand it into its partitions using
+ * its partition descriptor, and make a partition rel list out of those. The
+ * rel passed in itself is not kept part of the partition list. The caller
+ * should handle the first rel separately before calling this function.
+ */
+void
+partition_walker_init(PartitionWalker * walker, Relation rel)
+{
+	memset(walker, 0, sizeof(PartitionWalker));
+
+	walker->rels_list = append_child_partitions(walker->rels_list, rel);
+
+	/* Assign the first one as the current partition cell */
+	walker->cur_cell = list_head(walker->rels_list);
+}
+
+/*
+ * partition_walker_next
+ *
+ * Get the next partition in the partition tree.
+ * At the same time, if the partition is a partitioned table, append its
+ * children at the end, so that the next time we can traverse through these.
+ */
+Relation
+partition_walker_next(PartitionWalker * walker, Relation *parent)
+{
+	ChildPartitionInfo *pc;
+	Relation	partrel;
+
+	if (walker->cur_cell == NULL)
+		return NULL;
+
+	pc = (ChildPartitionInfo *) lfirst(walker->cur_cell);
+	if (parent)
+		*parent = pc->parent;
+
+	/* Open partrel without locking; find_all_inheritors() has locked it */
+	partrel = heap_open(pc->reloid, NoLock);
+
+	/*
+	 * Append the children of partrel to the same list that we are iterating
+	 * on.
+	 */
+	walker->rels_list = append_child_partitions(walker->rels_list, partrel);
+
+	/* Bump the cur_cell here at the end, because above, we modify the list */
+	walker->cur_cell = lnext(walker->cur_cell);
+
+	return partrel;
+}
+
+/*
+ * append_child_partitions
+ *
+ * Append OIDs of rel's partitions to the list 'rel_list' and for each OID,
+ * also store parent rel.
+ */
+static List *
+append_child_partitions(List *rel_list, Relation rel)
+{
+	int			i;
+	PartitionDescData *partdesc = RelationGetPartitionDesc(rel);
+
+	/* If it's not a partitioned table, we have nothing to append */
+	if (!partdesc)
+		return rel_list;
+
+	for (i = 0; i < partdesc->nparts; i++)
+	{
+		ChildPartitionInfo *pc = palloc(sizeof(ChildPartitionInfo));
+
+		pc->parent = rel;
+		pc->reloid = rel->rd_partdesc->oids[i];
+		rel_list = lappend(rel_list, pc);
+	}
+	return rel_list;
+}
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 53e2965..6fb3ed6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -1426,13 +1426,15 @@ BeginCopy(ParseState *pstate,
 		if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		{
 			PartitionDispatch *partition_dispatch_info;
-			ResultRelInfo *partitions;
+			ResultRelInfo **partitions;
 			TupleConversionMap **partition_tupconv_maps;
 			TupleTableSlot *partition_tuple_slot;
 			int			num_parted,
 						num_partitions;
 
 			ExecSetupPartitionTupleRouting(rel,
+										   NULL,
+										   0,
 										   1,
 										   &partition_dispatch_info,
 										   &partitions,
@@ -1462,7 +1464,7 @@ BeginCopy(ParseState *pstate,
 				for (i = 0; i < cstate->num_partitions; ++i)
 				{
 					cstate->transition_tupconv_maps[i] =
-						convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+						convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 											   RelationGetDescr(rel),
 											   gettext_noop("could not convert row type"));
 				}
@@ -2609,7 +2611,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2718,7 +2720,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2838,7 +2840,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index b502941..2e2bec8 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -67,15 +67,6 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 /* How many levels deep into trigger execution are we? */
 static int	MyTriggerDepth = 0;
 
-/*
- * Note that similar macros also exist in executor/execMain.c.  There does not
- * appear to be any good header to put them into, given the structures that
- * they use, so we let them be duplicated.  Be sure to update all if one needs
- * to be changed, however.
- */
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
-
 /* Local function prototypes */
 static void ConvertTriggerToFK(CreateTrigStmt *stmt, Oid funcoid);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c11aa4f..fc7d3ed 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -64,6 +64,18 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+/*
+ * Entry of a temporary hash table. During UPDATE tuple routing, we want to
+ * know which of the leaf partitions are present in the UPDATE per-subplan
+ * resultRelInfo array (ModifyTableState->resultRelInfo[]). This hash table
+ * is searchable by the oids of the subplan result rels.
+ */
+typedef struct ResultRelOidsEntry
+{
+	Oid			rel_oid;
+	ResultRelInfo *resultRelInfo;
+} ResultRelOidsEntry;
+
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
@@ -103,19 +115,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
-/*
- * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
- * not appear to be any good header to put it into, given the structures that
- * it uses, so we let them be duplicated.  Be sure to update both if one needs
- * to be changed, however.
- */
-#define GetInsertedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /* end of local decls */
 
@@ -1823,15 +1822,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1859,52 +1853,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1912,7 +1920,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2027,8 +2036,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3213,10 +3223,14 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
  *		entry for every leaf partition (required to convert input tuple based
@@ -3236,9 +3250,11 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
@@ -3247,17 +3263,45 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/* Get the tuple-routing information and lock partitions */
 	*pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
 										   &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+										   sizeof(ResultRelInfo*));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by one
+		 * while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid repeated
+		 * pallocs by allocating memory for all the result rels in bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -3266,23 +3310,75 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root partition
+				 * tuple descriptor. When generating the update plans, this was
+				 * not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf partitions.
+			 * Note that each of the newly opened relations in *partitions are
+			 * eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  0);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
-		 * Verify result relation is a valid target for the current operation.
+		 * Verify result relation is a valid target for insert operation.
+		 * Even for updates, we are doing this for tuple-routing, so again,
+		 * we need to check the validity for insert operation.
 		 */
 		CheckValidResultRel(partrel, CMD_INSERT);
 
@@ -3293,12 +3389,6 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  0);
-
 		/*
 		 * Open partition indices (remember we do not support ON CONFLICT in
 		 * case of partitioned tables, so we do not need support information
@@ -3308,9 +3398,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 			leaf_part_rri->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(leaf_part_rri, false);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to the
+	 * last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
@@ -3336,8 +3435,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 3819de2..7cb1c2c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 30add8e..48708cf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,7 +54,6 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
-
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
 					 ItemPointer conflictTid,
@@ -239,6 +239,34 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_slot. If no mapping present, keeps
+ * p_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple, TupleTableSlot **p_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_slot = mtstate->mt_partition_tuple_slot;
+	Assert(*p_slot != NULL);
+	ExecSetSlotDescriptor(*p_slot, map->outdesc);
+	ExecStoreTuple(tuple, *p_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -280,7 +308,38 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_resultrel_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_resultrel_maps[map_index],
+											  tuple, &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -290,7 +349,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -302,7 +361,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -347,23 +406,9 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_partition_tupconv_maps[leaf_part_index],
+										  tuple, &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -481,7 +526,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -673,6 +718,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -681,6 +728,9 @@ ExecDelete(ModifyTableState *mtstate,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (delete_skipped)
+		*delete_skipped = true;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -844,12 +894,16 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
 						 mtstate->mt_transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -1038,12 +1092,66 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			return ExecInsert(mtstate, slot, planSlot, NULL,
+							  ONCONFLICT_NONE, estate, canSetTag);
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1482,23 +1590,22 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	 */
 	if (mtstate->mt_transition_capture != NULL)
 	{
-		ResultRelInfo *resultRelInfos;
-		int		numResultRelInfos;
+		ResultRelInfo *resultRelInfo;
+		int			numResultRelInfos;
+		bool		tuple_routing = (mtstate->mt_partition_dispatch_info != NULL);
 
 		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (tuple_routing)
 		{
 			/*
 			 * For INSERT via partitioned table, so we need TupleDescs based
 			 * on the partition routing table.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
 			numResultRelInfos = mtstate->mt_num_partitions;
 		}
 		else
 		{
 			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
 			numResultRelInfos = mtstate->mt_nplans;
 		}
 
@@ -1512,8 +1619,15 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 		for (i = 0; i < numResultRelInfos; ++i)
 		{
+			/*
+			 * As stated above, mapping source is different for INSERT or
+			 * otherwise.
+			 */
+			resultRelInfo = (tuple_routing ?
+							 mtstate->mt_partitions[i] : &mtstate->resultRelInfo[i]);
+
 			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
@@ -1749,7 +1863,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1794,9 +1909,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1869,6 +1987,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1906,18 +2033,28 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   &partition_dispatch_info,
 									   &partitions,
@@ -1930,6 +2067,44 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_partitions = num_partitions;
 		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
+	}
+
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update or if there are no partitions below this partitioned table.
+	 */
+	if (update_tuple_routing_needed && mtstate->mt_num_partitions > 0)
+	{
+		TupleConversionMap **tup_conv_maps;
+		TupleDesc	outdesc;
+
+		mtstate->mt_resultrel_maps =
+			(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+		/* Get tuple descriptor of the root partition. */
+		outdesc = RelationGetDescr(mtstate->mt_partition_dispatch_info[0]->reldesc);
+
+		resultRelInfo = mtstate->resultRelInfo;
+		tup_conv_maps = mtstate->mt_resultrel_maps;
+		for (i = 0; i < nplans; i++)
+		{
+			TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+			tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+													  gettext_noop("could not convert row type"));
+		}
 	}
 
 	/* Build state for collecting transition tuples */
@@ -1965,50 +2140,54 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2019,7 +2198,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2056,20 +2235,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2315,6 +2500,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/* Free transition tables */
 	if (node->mt_transition_capture != NULL)
@@ -2351,7 +2537,17 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 45a04b0..4156e02 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2256,6 +2257,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8d92c03..f2df72b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 379d92a..2ca8a71 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2094,6 +2095,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2516,6 +2518,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 86c811d..949053c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index f087ddb..064af0f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1291,7 +1291,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	rte = planner_rt_fetch(rel->relid, root);
 	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, rel->relid);
+		partitioned_rels = get_partitioned_child_rels(root, rel->relid, NULL);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 5c934f2..fa270f8 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2357,6 +2358,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6399,6 +6401,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6425,6 +6428,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2988c11..cf91907 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1042,6 +1042,7 @@ inheritance_planner(PlannerInfo *root)
 	Index		rti;
 	RangeTblEntry *parent_rte;
 	List	   *partitioned_rels = NIL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1356,9 +1357,15 @@ inheritance_planner(PlannerInfo *root)
 
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
+		Bitmapset  *all_part_cols = NULL;
+
+		partitioned_rels = get_partitioned_child_rels(root, parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/* Result path must go into outer query's FINAL upperrel */
@@ -1415,6 +1422,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2032,6 +2040,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6062,10 +6071,15 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: Only call this function on RTEs known to be partitioned tables.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6077,6 +6091,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74..2e6fde7 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -99,6 +100,8 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
 static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
 						 Index rti);
+static Relation get_next_child(Relation oldrelation, ListCell **cell,
+						PartitionWalker *walker);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
 						  Index newvarno,
@@ -1370,13 +1373,16 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	Oid			parentOID;
 	PlanRowMark *oldrc;
 	Relation	oldrelation;
+	Relation	newrelation;
 	LOCKMODE	lockmode;
 	List	   *inhOIDs;
 	List	   *appinfos;
-	ListCell   *l;
+	ListCell   *oids_cell;
 	bool		need_append;
 	PartitionedChildRelInfo *pcinfo;
+	PartitionWalker walker;
 	List	   *partitioned_child_rels = NIL;
+	Bitmapset  *all_part_cols = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1449,20 +1455,41 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	/* Scan the inheritance set and expand it */
 	appinfos = NIL;
 	need_append = false;
-	foreach(l, inhOIDs)
+	newrelation = oldrelation;
+
+	/* For non-partitioned result-rels, open the first child from inhOIDs */
+	if (oldrelation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+	{
+		oids_cell = list_head(inhOIDs);
+		newrelation = get_next_child(oldrelation, &oids_cell, &walker);
+	}
+	else
 	{
-		Oid			childOID = lfirst_oid(l);
-		Relation	newrelation;
+		/*
+		 * For partitioned resultrels, we don't need the inhOIDs list itself,
+		 * because we anyways traverse the tree in canonical order; but we do
+		 * want to lock all the children in a consistent order (see
+		 * find_inheritance_children), so as to avoid unnecessary deadlocks.
+		 * Hence, the call to find_all_inheritors() above. The aim is to
+		 * generate the appinfos in canonical order so that the result rels,
+		 * if generated later, are in the same order as those of the leaf
+		 * partitions that are maintained during insert/update tuple routing.
+		 * Maintaining same order would speed up searching for a given leaf
+		 * partition in these result rels.
+		 */
+		list_free(inhOIDs);
+		inhOIDs = NIL;
+		partition_walker_init(&walker, oldrelation);
+	}
+
+	for (; newrelation != NULL;
+		   newrelation = get_next_child(oldrelation, &oids_cell, &walker))
+	{
+		Oid			childOID = RelationGetRelid(newrelation);
 		RangeTblEntry *childrte;
 		Index		childRTindex;
 		AppendRelInfo *appinfo;
 
-		/* Open rel if needed; we already have required locks */
-		if (childOID != parentOID)
-			newrelation = heap_open(childOID, NoLock);
-		else
-			newrelation = oldrelation;
-
 		/*
 		 * It is possible that the parent table has children that are temp
 		 * tables of other backends.  We cannot safely access such tables
@@ -1535,8 +1562,12 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			}
 		}
 		else
+		{
 			partitioned_child_rels = lappend_int(partitioned_child_rels,
 												 childRTindex);
+			pull_child_partition_columns(&all_part_cols, newrelation,
+										 oldrelation);
+		}
 
 		/*
 		 * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
@@ -1604,6 +1635,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 		pcinfo->parent_relid = rti;
 		pcinfo->child_rels = partitioned_child_rels;
+		pcinfo->all_part_cols = all_part_cols;
 		root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 	}
 
@@ -1612,6 +1644,44 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 }
 
 /*
+ * Get the next child in an inheritance tree.
+ *
+ * This function is called to traverse two different types of lists. If it's a
+ * list containing partitions, is_partitioned is true, and 'walker' is valid.
+ * Otherwise, 'cell' points to a position in the list of inheritance children.
+ * For partitions walker, the partition traversal is done using canonical
+ * ordering. Whereas, for inheritence children, list is already prepared, and
+ * is ordered depending upon the pg_inherit scan.
+ *
+ * oldrelation is the root relation in the inheritence tree. This is unused in
+ * case of is_partitioned=true.
+ */
+static Relation
+get_next_child(Relation oldrelation, ListCell **cell, PartitionWalker *walker)
+{
+	if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		return partition_walker_next(walker, NULL);
+	else
+	{
+		Oid		childOID;
+
+		if (!*cell)
+			return NULL; /* We are done with the list */
+
+		childOID = lfirst_oid(*cell);
+
+		/* Prepare to get the next child. */
+		*cell = lnext(*cell);
+
+		/* If it's the root relation, it is already open */
+		if (childOID != RelationGetRelid(oldrelation))
+			return heap_open(childOID, NoLock);
+		else
+			return oldrelation;
+	}
+}
+
+/*
  * make_inh_translation_list
  *	  Build the list of translations from parent Vars to child Vars for
  *	  an inheritance child.
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f2d6385..f63edf4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3161,6 +3161,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3174,6 +3176,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3241,6 +3244,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 434ded3..e86c681 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -79,12 +85,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int lockmode, int *num_parted,
@@ -99,4 +109,8 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
+
 #endif							/* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9..6c58694 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,9 +210,11 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -217,6 +222,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 35c28a6..6e41c86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -508,6 +508,11 @@ typedef struct EState
 	struct dsa_area *es_query_dsa;
 } EState;
 
+/* For a given result relation, get its columns being inserted/updated. */
+#define GetInsertedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /*
  * ExecRowMark -
@@ -975,9 +980,13 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
 	/* Per partition tuple conversion map */
+	TupleConversionMap **mt_partition_tupconv_maps;
+	/* Per resultRelInfo conversion map to convert tuples to root partition */
+	TupleConversionMap **mt_resultrel_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 									/* controls transition table population */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f1a1b24..cd670b9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9bae3c6..3013964 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2019,6 +2020,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2027,6 +2032,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 0c0549d..d35f448 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -235,6 +235,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..be9c571 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,217 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
 insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted, non_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..a150884 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,146 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c int
 ) partition by range (a, b);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
 create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
 create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_c_1_100 (b int, c int, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+create table part_c_100_200 (c int, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+insert into part_a_1_a_10 values ('a', 1);
+insert into part_a_10_a_20 values ('a', 10, 200);
+insert into part_c_1_100 (a, b, c) values ('b', 12, 96);
+insert into part_c_1_100 (a, b, c) values ('b', 13, 97);
+insert into part_c_100_200 (a, b, c) values ('b', 15, 105);
+insert into part_c_100_200 (a, b, c) values ('b', 17, 105);
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+drop table mintab, range_parted;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a int, b int, c int) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int, a int);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int, a int);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a int, b int, c int);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass , * from list_parted order by 1, 2, 3, 4;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted, non_parted;

#140

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

over 8 years ago

In reply to: Amit Khandekar (#139)

2 attachment(s)

Re: UPDATE of partition key

On Fri, Aug 4, 2017 at 10:28 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Below are the TODOS at this point :

Fix for bug reported by Rajkumar about update with join.

I had explained the root issue of this bug here : [1]

Attached patch includes the fix, which is explained below.

Hi Amit,

I have applied v14 patch and tested from my side, everything looks good to
me. attaching some of test case and out file for reference.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

#141

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#139)

1 attachment(s)

Re: UPDATE of partition key

On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Below are the TODOS at this point :

Do something about two separate mapping tables for Transition tables
and update tuple-routing.

On 1 July 2017 at 03:15, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Would make sense to have a set of functions with names like
GetConvertor{From,To}{Subplan,Leaf}(mtstate, index) which build arrays
m_convertors_{from,to}_by_{subplan,leaf} the first time they need
them?

This was discussed here : [2]. I think even if we have them built when
needed, still in presence of both tuple routing and transition tables,
we do need separate arrays. So I think rather than dynamic arrays, we
can have static arrays but their elements will point to a shared
TupleConversionMap structure whenever possible.
As already in the patch, in case of insert/update tuple routing, there
is a per-leaf partition mt_transition_tupconv_maps array for
transition tables, and a separate per-subplan arry mt_resultrel_maps
for update tuple routing. *But*, what I am proposing is: for the
mt_transition_tupconv_maps[] element for which the leaf partition also
exists as a per-subplan result, that array element and the
mt_resultrel_maps[] element will point to the same TupleConversionMap
structure.

This is quite similar to how we are re-using the per-subplan
resultrels for the per-leaf result rels. We will re-use the
per-subplan TupleConversionMap for the per-leaf
mt_transition_tupconv_maps[] elements.

Not yet implemented this.

The attached patch has the above needed changes. Now we have following
map arrays in ModifyTableState. The earlier naming was confusing so I
renamed them.
mt_perleaf_parentchild_maps : To be used for converting insert/update
routed tuples from root to the destination leaf partition.
mt_perleaf_childparent_maps : To be used for transition tables for
converting back the tuples from leaf partition to root.
mt_persubplan_childparent_maps : To be used by both transition tables
and update-row movement for their own different purpose for UPDATEs.

I also had to add another partition slot mt_rootpartition_tuple_slot
alongside mt_partition_tuple_slot. For update-row-movement, in
ExecInsert(), we used to have a common slot for root partition's tuple
as well as leaf partition tuple. So the former tuple was a transient
tuple. But mtstate->mt_transition_capture->tcs_original_insert_tuple
requires the tuple to be valid, so we could not pass a transient
tuple. Hence another partition slot.

-------

But in the first place, while testing transition tables behaviour with
update row movement, I found out that transition tables OLD TABLE AND
NEW TABLE don't get populated with the rows that are moved to another
partition. This is because the operation is ExecDelete() and
ExecInsert(), which don't run the transition-related triggers for
updates. Even though transition-table-triggers are statement-level,
the AR ROW trigger-related functions like ExecARUpdateTriggers() do
get run for each row, so that the tables get populated; and they skip
the usual row-level trigger stuff. For update-row-movement, we need to
teach ExecARUpdateTriggers() to run the transition-related processing
for the DELETE+INESRT operation as well. But since delete and insert
happen on different tables, we cannot call ExecARUpdateTriggers() at a
single place. We need to call it once after ExecDelete() for loading
the OLD row, and then after ExecInsert() for loading the NEW row.
Also, currently ExecARUpdateTriggers() does not allow NULL old tuple
or new tuple, but we need to allow it for the above transition table
processing.

The attached patch has the above needed changes.

Use getASTriggerResultRelInfo() for attrno mapping, rather than first
resultrel, for generating child WCO/RETURNING expression.

Regarding generating child WithCheckOption and Returning expressions
using those of the root result relation, ModifyTablePath and
ModifyTable should have new fields rootReturningList (and
rootWithCheckOptions) which would be derived from
root->parse->returningList in inheritance_planner(). But then, similar
to per-subplan returningList, rootReturningList would have to pass
through set_plan_refs()=>set_returning_clause_references() which
requires the subplan targetlist to be passed. Because of this, for
rootReturningList, we require a subplan for root partition, which is
not there currently; we have subpans only for child rels. That means
we would have to create such plan only for the sake of generating
rootReturningList.

The other option is to do the way the patch is currently doing in the
executor by using the returningList of the first per-subplan result
rel to generate the other child returningList (and WithCheckOption).
This is working by applying map_partition_varattnos() to the first
returningList. But now that we realized that we have to specially
handle whole-row vars, map_partition_varattnos() would need some
changes to convert whole row vars differently for
child-rel-to-child-rel mapping. For childrel-to-childrel conversion,
the whole-row var is already wrapped by ConvertRowtypeExpr, but we
need to change its Var->vartype to the new child vartype.

I think the second option looks easier, but I am open to suggestions,
and I am myself still checking the first one.

I have done the changes using the second option above. In the attached
patch, the same map_partition_varattnos() is called for child-to-child
mapping. But in such case, the source child partition already has
ConvertRowtypeExpr node, so another ConvertRowtypeExpr node is not
added; just the containing var node is updated with the new composite
type. In the regression test, I have included different types like
numeric, int, text for the partition key columns, so as to test the
same.

More test scenarios in regression tests.
Need to check/test whether we are correctly applying insert policies
(ant not update) while inserting a routed tuple.

Yet to do above two.

This is still to do.

Attachments:

update-partition-key_v15.patchapplication/octet-stream; name=update-partition-key_v15.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 0e4b343..642dff4 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -105,6 +105,16 @@ typedef struct PartitionRangeBound
 	bool		lower;			/* this is the lower (vs upper) bound */
 } PartitionRangeBound;
 
+/*
+ * List of these elements is prepared while traversing a partition tree,
+ * so as to get a consistent order of partitions.
+ */
+typedef struct ChildPartitionInfo
+{
+	Oid			reloid;
+	Relation	parent;			/* Parent relation of reloid */
+}			ChildPartitionInfo;
+
 static int32 qsort_partition_list_value_cmp(const void *a, const void *b,
 							   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
@@ -140,6 +150,8 @@ static int partition_bound_bsearch(PartitionKey key,
 						PartitionBoundInfo boundinfo,
 						void *probe, bool probe_is_bound, bool *is_equal);
 
+static List *append_child_partitions(List *rel_list, Relation rel);
+
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -899,7 +911,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -912,8 +925,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	AttrNumber *part_attnos;
@@ -922,14 +935,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 											 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
-										RelationGetForm(partrel)->reltype,
+										RelationGetDescr(from_rel)->natts,
+										RelationGetForm(to_rel)->reltype,
 										&my_found_whole_row);
 	if (found_whole_row)
 		*found_whole_row = my_found_whole_row;
@@ -982,21 +995,6 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-	do\
-	{\
-		int		i;\
-		for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-		{\
-			(partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\
-			(parents) = lappend((parents), (rel));\
-		}\
-	} while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
  *		Returns information necessary to route tuples down a partition tree
  *
@@ -1008,11 +1006,13 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 								 int *num_parted, List **leaf_part_oids)
 {
+	PartitionWalker walker;
 	PartitionDispatchData **pd;
-	List	   *all_parts = NIL,
-			   *all_parents = NIL,
-			   *parted_rels,
+	Relation	partrel;
+	Relation	parent;
+	List	   *parted_rels,
 			   *parted_rel_parents;
+	List	   *inhOIDs;
 	ListCell   *lc1,
 			   *lc2;
 	int			i,
@@ -1023,21 +1023,28 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 	 * Lock partitions and make a list of the partitioned ones to prepare
 	 * their PartitionDispatch objects below.
 	 *
-	 * Cannot use find_all_inheritors() here, because then the order of OIDs
-	 * in parted_rels list would be unknown, which does not help, because we
-	 * assign indexes within individual PartitionDispatch in an order that is
-	 * predetermined (determined by the order of OIDs in individual partition
-	 * descriptors).
+	 * Must call find_all_inheritors() here so as to lock the partitions in a
+	 * consistent order (by oid values) to prevent deadlocks. But we assign
+	 * indexes within individual PartitionDispatch in a different order
+	 * (determined by the order of OIDs in individual partition descriptors).
+	 * So, rather than using the oids returned by find_all_inheritors(), we
+	 * generate canonically ordered oids using partition walker.
 	 */
+	inhOIDs = find_all_inheritors(RelationGetRelid(rel), lockmode, NULL);
+	list_free(inhOIDs);
+
+	partition_walker_init(&walker, rel);
+	parent = NULL;
 	*num_parted = 1;
 	parted_rels = list_make1(rel);
 	/* Root partitioned table has no parent, so NULL for parent */
 	parted_rel_parents = list_make1(NULL);
-	APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
-	forboth(lc1, all_parts, lc2, all_parents)
+
+	/* Go to the next partition */
+	partrel = partition_walker_next(&walker, &parent);
+
+	for (; partrel != NULL; partrel = partition_walker_next(&walker, &parent))
 	{
-		Relation	partrel = heap_open(lfirst_oid(lc1), lockmode);
-		Relation	parent = lfirst(lc2);
 		PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
 
 		/*
@@ -1049,7 +1056,6 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
 			(*num_parted)++;
 			parted_rels = lappend(parted_rels, partrel);
 			parted_rel_parents = lappend(parted_rel_parents, parent);
-			APPEND_REL_PARTITION_OIDS(partrel, all_parts, all_parents);
 		}
 		else
 			heap_close(partrel, NoLock);
@@ -2068,6 +2074,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
@@ -2334,3 +2411,84 @@ partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo,
 
 	return lo;
 }
+
+/*
+ * partition_walker_init
+ *
+ * Using the passed partitioned relation, expand it into its partitions using
+ * its partition descriptor, and make a partition rel list out of those. The
+ * rel passed in itself is not kept part of the partition list. The caller
+ * should handle the first rel separately before calling this function.
+ */
+void
+partition_walker_init(PartitionWalker * walker, Relation rel)
+{
+	memset(walker, 0, sizeof(PartitionWalker));
+
+	walker->rels_list = append_child_partitions(walker->rels_list, rel);
+
+	/* Assign the first one as the current partition cell */
+	walker->cur_cell = list_head(walker->rels_list);
+}
+
+/*
+ * partition_walker_next
+ *
+ * Get the next partition in the partition tree.
+ * At the same time, if the partition is a partitioned table, append its
+ * children at the end, so that the next time we can traverse through these.
+ */
+Relation
+partition_walker_next(PartitionWalker * walker, Relation *parent)
+{
+	ChildPartitionInfo *pc;
+	Relation	partrel;
+
+	if (walker->cur_cell == NULL)
+		return NULL;
+
+	pc = (ChildPartitionInfo *) lfirst(walker->cur_cell);
+	if (parent)
+		*parent = pc->parent;
+
+	/* Open partrel without locking; find_all_inheritors() has locked it */
+	partrel = heap_open(pc->reloid, NoLock);
+
+	/*
+	 * Append the children of partrel to the same list that we are iterating
+	 * on.
+	 */
+	walker->rels_list = append_child_partitions(walker->rels_list, partrel);
+
+	/* Bump the cur_cell here at the end, because above, we modify the list */
+	walker->cur_cell = lnext(walker->cur_cell);
+
+	return partrel;
+}
+
+/*
+ * append_child_partitions
+ *
+ * Append OIDs of rel's partitions to the list 'rel_list' and for each OID,
+ * also store parent rel.
+ */
+static List *
+append_child_partitions(List *rel_list, Relation rel)
+{
+	int			i;
+	PartitionDescData *partdesc = RelationGetPartitionDesc(rel);
+
+	/* If it's not a partitioned table, we have nothing to append */
+	if (!partdesc)
+		return rel_list;
+
+	for (i = 0; i < partdesc->nparts; i++)
+	{
+		ChildPartitionInfo *pc = palloc(sizeof(ChildPartitionInfo));
+
+		pc->parent = rel;
+		pc->reloid = rel->rd_partdesc->oids[i];
+		rel_list = lappend(rel_list, pc);
+	}
+	return rel_list;
+}
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 53e2965..6fb3ed6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -1426,13 +1426,15 @@ BeginCopy(ParseState *pstate,
 		if (is_from && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		{
 			PartitionDispatch *partition_dispatch_info;
-			ResultRelInfo *partitions;
+			ResultRelInfo **partitions;
 			TupleConversionMap **partition_tupconv_maps;
 			TupleTableSlot *partition_tuple_slot;
 			int			num_parted,
 						num_partitions;
 
 			ExecSetupPartitionTupleRouting(rel,
+										   NULL,
+										   0,
 										   1,
 										   &partition_dispatch_info,
 										   &partitions,
@@ -1462,7 +1464,7 @@ BeginCopy(ParseState *pstate,
 				for (i = 0; i < cstate->num_partitions; ++i)
 				{
 					cstate->transition_tupconv_maps[i] =
-						convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+						convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 											   RelationGetDescr(rel),
 											   gettext_noop("could not convert row type"));
 				}
@@ -2609,7 +2611,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2718,7 +2720,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2838,7 +2840,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index b502941..407fcd2 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -67,15 +67,6 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 /* How many levels deep into trigger execution are we? */
 static int	MyTriggerDepth = 0;
 
-/*
- * Note that similar macros also exist in executor/execMain.c.  There does not
- * appear to be any good header to put them into, given the structures that
- * they use, so we let them be duplicated.  Be sure to update all if one needs
- * to be changed, however.
- */
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
-
 /* Local function prototypes */
 static void ConvertTriggerToFK(CreateTrigStmt *stmt, Oid funcoid);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
@@ -2903,8 +2894,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5211,7 +5207,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5260,12 +5261,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool update_new_table = transition_capture->tcs_update_new_table;
 		bool insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_old_tuplestore;
 
 			if (map != NULL)
@@ -5278,12 +5279,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			if (event == TRIGGER_EVENT_INSERT)
 				new_tuplestore = transition_capture->tcs_insert_tuplestore;
 			else
@@ -5306,7 +5307,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 6671a25..6712e72 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -64,7 +64,6 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
-
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
 ExecutorRun_hook_type ExecutorRun_hook = NULL;
@@ -103,19 +102,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
-/*
- * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
- * not appear to be any good header to put it into, given the structures that
- * it uses, so we let them be duplicated.  Be sure to update both if one needs
- * to be changed, however.
- */
-#define GetInsertedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /* end of local decls */
 
@@ -1823,15 +1809,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1859,52 +1840,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1912,7 +1907,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2027,8 +2023,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3213,33 +3210,39 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *		with on entry for every leaf partition (required to convert input tuple
+ *		based on the root table's rowtype to a leaf partition's rowtype after
+ *		tuple routing is done)
  * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
  *		to manipulate any given leaf partition's rowtype after that partition
  *		is chosen by tuple-routing.
  * 'num_parted' receives the number of partitioned tables in the partition
  *		tree (= the number of entries in the 'pd' output array)
  * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *		tree (= the number of entries in the 'partitions' and
+ *		'perleaf_parentchild_maps' output arrays
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
-							   TupleConversionMap ***tup_conv_maps,
+							   ResultRelInfo ***partitions,
+							   TupleConversionMap ***perleaf_parentchild_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3247,16 +3250,45 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/* Get the tuple-routing information and lock partitions */
 	*pd = RelationGetPartitionDispatchInfo(rel, RowExclusiveLock, num_parted,
 										   &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+											sizeof(ResultRelInfo *));
+	*perleaf_parentchild_maps = (TupleConversionMap **) palloc0(*num_partitions *
+																sizeof(TupleConversionMap *));
+
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a
+		 * new result rel. The per-subplan resultrels and the resultrels of
+		 * the leaf partitions are both in the same canonical order. So while
+		 * going through the leaf partition oids, we need to keep track of the
+		 * next per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by
+		 * one while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -3266,23 +3298,76 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  0);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
-		 * Verify result relation is a valid target for the current operation.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(partrel, CMD_INSERT);
 
@@ -3290,14 +3375,8 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
-
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  0);
+		(*perleaf_parentchild_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+																gettext_noop("could not convert row type"));
 
 		/*
 		 * Open partition indices (remember we do not support ON CONFLICT in
@@ -3308,9 +3387,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 			leaf_part_rri->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(leaf_part_rri, false);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to
+	 * the last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
@@ -3336,8 +3424,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 3819de2..7cb1c2c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 30add8e..0d334e8 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,7 +54,6 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
-
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
 					 ItemPointer conflictTid,
@@ -239,6 +239,36 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot, TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -280,17 +310,50 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -302,7 +365,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -330,7 +393,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -347,23 +410,11 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -481,7 +532,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -621,6 +672,19 @@ ExecInsert(ModifyTableState *mtstate,
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
 						 mtstate->mt_transition_capture);
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
 	list_free(recheckIndexes);
 
 	/*
@@ -673,6 +737,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -681,6 +747,9 @@ ExecDelete(ModifyTableState *mtstate,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (delete_skipped)
+		*delete_skipped = true;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -844,12 +913,29 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
 						 mtstate->mt_transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -942,6 +1028,8 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
+
 
 	/*
 	 * abort the operation if not running transactions
@@ -1038,12 +1126,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1462,6 +1620,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1469,63 +1666,115 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int		i;
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
+	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc);
 
+	if (mtstate->mt_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.
 	 */
-	if (mtstate->mt_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next plan.
+	 * (INSERT operations set it every time.)
+	 */
+	if (mtstate->mt_persubplan_childparent_maps)
+	{
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
+
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
+
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
 	{
-		ResultRelInfo *resultRelInfos;
-		int		numResultRelInfos;
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
 
-		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
 		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For INSERT via partitioned table, so we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
-			numResultRelInfos = mtstate->mt_num_partitions;
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
-			numResultRelInfos = mtstate->mt_nplans;
-		}
-
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
-		for (i = 0; i < numResultRelInfos; ++i)
-		{
-			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time.)
-		 */
-		mtstate->mt_transition_capture->tcs_map =
-			mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1631,9 +1880,9 @@ ExecModifyTable(PlanState *pstate)
 				if (node->mt_transition_capture != NULL)
 				{
 					/* Prepare to convert transition tuples from this child. */
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1749,7 +1998,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1794,9 +2044,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1869,6 +2122,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1906,32 +2168,62 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
-		TupleConversionMap **partition_tupconv_maps;
+		ResultRelInfo **partitions;
+		TupleConversionMap **perleaf_parentchild_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   &partition_dispatch_info,
 									   &partitions,
-									   &partition_tupconv_maps,
+									   &perleaf_parentchild_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = perleaf_parentchild_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
 	/* Build state for collecting transition tuples */
 	ExecSetupTransitionCaptureState(mtstate, estate);
 
@@ -1965,50 +2257,54 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2019,7 +2315,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2056,20 +2352,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2315,6 +2617,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/* Free transition tables */
 	if (node->mt_transition_capture != NULL)
@@ -2351,13 +2654,25 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 45a04b0..4156e02 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2256,6 +2257,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8d92c03..f2df72b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 379d92a..2ca8a71 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2094,6 +2095,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2516,6 +2518,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 86c811d..949053c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index f087ddb..064af0f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1291,7 +1291,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	rte = planner_rt_fetch(rel->relid, root);
 	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, rel->relid);
+		partitioned_rels = get_partitioned_child_rels(root, rel->relid, NULL);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 5c934f2..fa270f8 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2357,6 +2358,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6399,6 +6401,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6425,6 +6428,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2988c11..cf91907 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1042,6 +1042,7 @@ inheritance_planner(PlannerInfo *root)
 	Index		rti;
 	RangeTblEntry *parent_rte;
 	List	   *partitioned_rels = NIL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1356,9 +1357,15 @@ inheritance_planner(PlannerInfo *root)
 
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
+		Bitmapset  *all_part_cols = NULL;
+
+		partitioned_rels = get_partitioned_child_rels(root, parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/* Result path must go into outer query's FINAL upperrel */
@@ -1415,6 +1422,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2032,6 +2040,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6062,10 +6071,15 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: Only call this function on RTEs known to be partitioned tables.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6077,6 +6091,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index cf46b74..2e6fde7 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -33,6 +33,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "catalog/partition.h"
 #include "catalog/pg_inherits_fn.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
@@ -99,6 +100,8 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
 static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
 						 Index rti);
+static Relation get_next_child(Relation oldrelation, ListCell **cell,
+						PartitionWalker *walker);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
 						  Index newvarno,
@@ -1370,13 +1373,16 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	Oid			parentOID;
 	PlanRowMark *oldrc;
 	Relation	oldrelation;
+	Relation	newrelation;
 	LOCKMODE	lockmode;
 	List	   *inhOIDs;
 	List	   *appinfos;
-	ListCell   *l;
+	ListCell   *oids_cell;
 	bool		need_append;
 	PartitionedChildRelInfo *pcinfo;
+	PartitionWalker walker;
 	List	   *partitioned_child_rels = NIL;
+	Bitmapset  *all_part_cols = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1449,20 +1455,41 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	/* Scan the inheritance set and expand it */
 	appinfos = NIL;
 	need_append = false;
-	foreach(l, inhOIDs)
+	newrelation = oldrelation;
+
+	/* For non-partitioned result-rels, open the first child from inhOIDs */
+	if (oldrelation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+	{
+		oids_cell = list_head(inhOIDs);
+		newrelation = get_next_child(oldrelation, &oids_cell, &walker);
+	}
+	else
 	{
-		Oid			childOID = lfirst_oid(l);
-		Relation	newrelation;
+		/*
+		 * For partitioned resultrels, we don't need the inhOIDs list itself,
+		 * because we anyways traverse the tree in canonical order; but we do
+		 * want to lock all the children in a consistent order (see
+		 * find_inheritance_children), so as to avoid unnecessary deadlocks.
+		 * Hence, the call to find_all_inheritors() above. The aim is to
+		 * generate the appinfos in canonical order so that the result rels,
+		 * if generated later, are in the same order as those of the leaf
+		 * partitions that are maintained during insert/update tuple routing.
+		 * Maintaining same order would speed up searching for a given leaf
+		 * partition in these result rels.
+		 */
+		list_free(inhOIDs);
+		inhOIDs = NIL;
+		partition_walker_init(&walker, oldrelation);
+	}
+
+	for (; newrelation != NULL;
+		   newrelation = get_next_child(oldrelation, &oids_cell, &walker))
+	{
+		Oid			childOID = RelationGetRelid(newrelation);
 		RangeTblEntry *childrte;
 		Index		childRTindex;
 		AppendRelInfo *appinfo;
 
-		/* Open rel if needed; we already have required locks */
-		if (childOID != parentOID)
-			newrelation = heap_open(childOID, NoLock);
-		else
-			newrelation = oldrelation;
-
 		/*
 		 * It is possible that the parent table has children that are temp
 		 * tables of other backends.  We cannot safely access such tables
@@ -1535,8 +1562,12 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			}
 		}
 		else
+		{
 			partitioned_child_rels = lappend_int(partitioned_child_rels,
 												 childRTindex);
+			pull_child_partition_columns(&all_part_cols, newrelation,
+										 oldrelation);
+		}
 
 		/*
 		 * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
@@ -1604,6 +1635,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 		pcinfo->parent_relid = rti;
 		pcinfo->child_rels = partitioned_child_rels;
+		pcinfo->all_part_cols = all_part_cols;
 		root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 	}
 
@@ -1612,6 +1644,44 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 }
 
 /*
+ * Get the next child in an inheritance tree.
+ *
+ * This function is called to traverse two different types of lists. If it's a
+ * list containing partitions, is_partitioned is true, and 'walker' is valid.
+ * Otherwise, 'cell' points to a position in the list of inheritance children.
+ * For partitions walker, the partition traversal is done using canonical
+ * ordering. Whereas, for inheritence children, list is already prepared, and
+ * is ordered depending upon the pg_inherit scan.
+ *
+ * oldrelation is the root relation in the inheritence tree. This is unused in
+ * case of is_partitioned=true.
+ */
+static Relation
+get_next_child(Relation oldrelation, ListCell **cell, PartitionWalker *walker)
+{
+	if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		return partition_walker_next(walker, NULL);
+	else
+	{
+		Oid		childOID;
+
+		if (!*cell)
+			return NULL; /* We are done with the list */
+
+		childOID = lfirst_oid(*cell);
+
+		/* Prepare to get the next child. */
+		*cell = lnext(*cell);
+
+		/* If it's the root relation, it is already open */
+		if (childOID != RelationGetRelid(oldrelation))
+			return heap_open(childOID, NoLock);
+		else
+			return oldrelation;
+	}
+}
+
+/*
  * make_inh_translation_list
  *	  Build the list of translations from parent Vars to child Vars for
  *	  an inheritance child.
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f2d6385..f63edf4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3161,6 +3161,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3174,6 +3176,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3241,6 +3244,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index ba706b2..ab72b36 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1224,6 +1224,7 @@ typedef struct
 	/* Target type when converting whole-row vars */
 	Oid			to_rowtype;
 	bool	   *found_whole_row;	/* output flag */
+	bool		coerced_var;	/* var is under ConvertRowTypeExpr */
 } map_variable_attnos_context;
 
 static Node *
@@ -1267,22 +1268,29 @@ map_variable_attnos_mutator(Node *node,
 					/* Don't convert unless necessary. */
 					if (context->to_rowtype != var->vartype)
 					{
-						ConvertRowtypeExpr *r;
-
 						/* Var itself is converted to the requested type. */
 						newvar->vartype = context->to_rowtype;
 
 						/*
-						 * And a conversion node on top to convert back to the
-						 * original type.
+						 * If this var is already under a ConvertRowtypeExpr,
+						 * we don't have to add another one.
 						 */
-						r = makeNode(ConvertRowtypeExpr);
-						r->arg = (Expr *) newvar;
-						r->resulttype = var->vartype;
-						r->convertformat = COERCE_IMPLICIT_CAST;
-						r->location = -1;
-
-						return (Node *) r;
+						if (!context->coerced_var)
+						{
+							ConvertRowtypeExpr *r;
+
+							/*
+							 * And a conversion node on top to convert back to
+							 * the original type.
+							 */
+							r = makeNode(ConvertRowtypeExpr);
+							r->arg = (Expr *) newvar;
+							r->resulttype = var->vartype;
+							r->convertformat = COERCE_IMPLICIT_CAST;
+							r->location = -1;
+
+							return (Node *) r;
+						}
 					}
 				}
 			}
@@ -1290,6 +1298,28 @@ map_variable_attnos_mutator(Node *node,
 		}
 		/* otherwise fall through to copy the var normally */
 	}
+	else if (IsA(node, ConvertRowtypeExpr))
+	{
+		ConvertRowtypeExpr *r = (ConvertRowtypeExpr *) node;
+
+		/*
+		 * If this is coercing a var (which is typical), convert only the var,
+		 * as against adding another ConvertRowtypeExpr over it.
+		 */
+		if (IsA(r->arg, Var))
+		{
+			ConvertRowtypeExpr *newnode;
+
+			newnode = (ConvertRowtypeExpr *) palloc(sizeof(ConvertRowtypeExpr));
+			*newnode = *r;
+			context->coerced_var = true;
+			newnode->arg = (Expr *) map_variable_attnos_mutator((Node *) r->arg, context);
+			context->coerced_var = false;
+
+			return (Node *) newnode;
+		}
+		/* Else fall through the expression tree mutator */
+	}
 	else if (IsA(node, Query))
 	{
 		/* Recurse into RTE subquery or not-yet-planned sublink subquery */
@@ -1321,6 +1351,7 @@ map_variable_attnos(Node *node,
 	context.map_length = map_length;
 	context.to_rowtype = to_rowtype;
 	context.found_whole_row = found_whole_row;
+	context.coerced_var = false;
 
 	*found_whole_row = false;
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 434ded3..e86c681 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -79,12 +85,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int lockmode, int *num_parted,
@@ -99,4 +109,8 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
+
 #endif							/* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 60326f9..6c58694 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,9 +210,11 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -217,6 +222,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 35c28a6..666abec 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -508,6 +508,11 @@ typedef struct EState
 	struct dsa_area *es_query_dsa;
 } EState;
 
+/* For a given result relation, get its columns being inserted/updated. */
+#define GetInsertedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /*
  * ExecRowMark -
@@ -975,14 +980,31 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 									/* controls transition table population */
-	TupleConversionMap **mt_transition_tupconv_maps;
-									/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7c51e7f..14dcd7d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9bae3c6..3013964 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2019,6 +2020,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2027,6 +2032,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 0c0549d..d35f448 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -235,6 +235,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..067eee6 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,430 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_c_100_200
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_100_200
+         Filter: (c > '97'::numeric)
+(16 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+truncate range_parted;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted | a | b  | c  
+--------------+---+----+----
+ (b,15,95)    | b | 15 | 95
+ (b,17,95)    | b | 17 | 95
+(2 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  95
+ part_c_1_100   | b | 17 |  95
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+truncate range_parted;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,110), (b,13,98), (b,15,106), (b,17,106)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 110
+ part_c_100_200 | b | 15 | 106
+ part_c_100_200 | b | 17 | 106
+ part_c_1_100   | b | 13 |  98
+(6 rows)
+
+truncate range_parted ;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,146), (b,13,147), (b,15,155), (b,17,155)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 146
+ part_c_100_200 | b | 13 | 147
+ part_c_100_200 | b | 15 | 155
+ part_c_100_200 | b | 17 | 155
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+truncate range_parted;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 110
+ part_c_100_200 | b | 17 | 106
+ part_c_100_200 | b | 19 | 106
+ part_c_1_100   | b | 15 |  98
+(6 rows)
+
+truncate range_parted ;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 146
+ part_c_100_200 | b | 16 | 147
+ part_c_100_200 | b | 17 | 155
+ part_c_100_200 | b | 19 | 155
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+truncate range_parted;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 150
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  55
+ part_c_1_100   | b | 17 |  55
+(6 rows)
+
+drop table mintab, range_parted CASCADE;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..7e1aaf7 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,258 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+truncate range_parted;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+truncate range_parted;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+truncate range_parted ;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trans_updatetrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+truncate range_parted;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+truncate range_parted ;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+truncate range_parted;
+insert into range_parted values ('a', 1, NULL), ('a', 10, 200), ('b', 12, 96), ('b', 13, 97), ('b', 15, 105), ('b', 17, 105);
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop table mintab, range_parted CASCADE;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#142

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#141)

Re: UPDATE of partition key

On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I am planning to review and test this patch, Seems like this patch
needs to be rebased.

[dilip@localhost postgresql]$ patch -p1 <
../patches/update-partition-key_v15.patch
patching file doc/src/sgml/ddl.sgml
patching file doc/src/sgml/ref/update.sgml
patching file doc/src/sgml/trigger.sgml
patching file src/backend/catalog/partition.c
Hunk #3 succeeded at 910 (offset -1 lines).
Hunk #4 succeeded at 924 (offset -1 lines).
Hunk #5 succeeded at 934 (offset -1 lines).
Hunk #6 succeeded at 994 (offset -1 lines).
Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines).
Hunk #8 FAILED at 1023.
Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines).
Hunk #10 succeeded at 2069 (offset 2 lines).
Hunk #11 succeeded at 2406 (offset 2 lines).
1 out of 11 hunks FAILED -- saving rejects to file
src/backend/catalog/partition.c.rej
patching file src/backend/commands/copy.c
Hunk #2 FAILED at 1426.
Hunk #3 FAILED at 1462.
Hunk #4 succeeded at 2616 (offset 7 lines).
Hunk #5 succeeded at 2726 (offset 8 lines).
Hunk #6 succeeded at 2846 (offset 8 lines).
2 out of 6 hunks FAILED -- saving rejects to file
src/backend/commands/copy.c.rej
patching file src/backend/commands/trigger.c
Hunk #4 succeeded at 5261 with fuzz 2.
patching file src/backend/executor/execMain.c
Hunk #1 succeeded at 65 (offset 1 line).
Hunk #2 succeeded at 103 (offset 1 line).
Hunk #3 succeeded at 1829 (offset 20 lines).
Hunk #4 succeeded at 1860 (offset 20 lines).
Hunk #5 succeeded at 1927 (offset 20 lines).
Hunk #6 succeeded at 2044 (offset 21 lines).
Hunk #7 FAILED at 3210.
Hunk #8 FAILED at 3244.
Hunk #9 succeeded at 3289 (offset 26 lines).
Hunk #10 FAILED at 3340.
Hunk #11 succeeded at 3387 (offset 29 lines).
Hunk #12 succeeded at 3424 (offset 29 lines).
3 out of 12 hunks FAILED -- saving rejects to file
src/backend/executor/execMain.c.rej
patching file src/backend/executor/execReplication.c
patching file src/backend/executor/nodeModifyTable.c

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#142)

Re: UPDATE of partition key

Thanks Dilip. I am working on rebasing the patch. Particularly, the
partition walker in my patch depended on the fact that all the tables
get opened (and then closed) while creating the tuple routing info.
But in HEAD, now only the partitioned tables get opened. So need some
changes in my patch.

The partition walker related changes are going to be inapplicable once
the other thread [1]/messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp commits the changes for expansion of inheritence
in bound-order, but till then I would have to rebase the partition
walker changes over HEAD.

[1]: /messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp

On 31 August 2017 at 12:09, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I am planning to review and test this patch, Seems like this patch
needs to be rebased.

[dilip@localhost postgresql]$ patch -p1 <
../patches/update-partition-key_v15.patch
patching file doc/src/sgml/ddl.sgml
patching file doc/src/sgml/ref/update.sgml
patching file doc/src/sgml/trigger.sgml
patching file src/backend/catalog/partition.c
Hunk #3 succeeded at 910 (offset -1 lines).
Hunk #4 succeeded at 924 (offset -1 lines).
Hunk #5 succeeded at 934 (offset -1 lines).
Hunk #6 succeeded at 994 (offset -1 lines).
Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines).
Hunk #8 FAILED at 1023.
Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines).
Hunk #10 succeeded at 2069 (offset 2 lines).
Hunk #11 succeeded at 2406 (offset 2 lines).
1 out of 11 hunks FAILED -- saving rejects to file
src/backend/catalog/partition.c.rej
patching file src/backend/commands/copy.c
Hunk #2 FAILED at 1426.
Hunk #3 FAILED at 1462.
Hunk #4 succeeded at 2616 (offset 7 lines).
Hunk #5 succeeded at 2726 (offset 8 lines).
Hunk #6 succeeded at 2846 (offset 8 lines).
2 out of 6 hunks FAILED -- saving rejects to file
src/backend/commands/copy.c.rej
patching file src/backend/commands/trigger.c
Hunk #4 succeeded at 5261 with fuzz 2.
patching file src/backend/executor/execMain.c
Hunk #1 succeeded at 65 (offset 1 line).
Hunk #2 succeeded at 103 (offset 1 line).
Hunk #3 succeeded at 1829 (offset 20 lines).
Hunk #4 succeeded at 1860 (offset 20 lines).
Hunk #5 succeeded at 1927 (offset 20 lines).
Hunk #6 succeeded at 2044 (offset 21 lines).
Hunk #7 FAILED at 3210.
Hunk #8 FAILED at 3244.
Hunk #9 succeeded at 3289 (offset 26 lines).
Hunk #10 FAILED at 3340.
Hunk #11 succeeded at 3387 (offset 29 lines).
Hunk #12 succeeded at 3424 (offset 29 lines).
3 out of 12 hunks FAILED -- saving rejects to file
src/backend/executor/execMain.c.rej
patching file src/backend/executor/execReplication.c
patching file src/backend/executor/nodeModifyTable.c

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#144

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#143)

Re: UPDATE of partition key

On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Thanks Dilip. I am working on rebasing the patch. Particularly, the
partition walker in my patch depended on the fact that all the tables
get opened (and then closed) while creating the tuple routing info.
But in HEAD, now only the partitioned tables get opened. So need some
changes in my patch.

The partition walker related changes are going to be inapplicable once
the other thread [1] commits the changes for expansion of inheritence
in bound-order, but till then I would have to rebase the partition
walker changes over HEAD.

[1] /messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp

After recent commit 30833ba154, now the partitions are expanded in
depth-first order. It didn't seem worthwhile rebasing my partition
walker changes onto the latest code. So in the attached patch, I have
removed all the partition walker changes. But
RelationGetPartitionDispatchInfo() traverses in breadth-first order,
which is different than the update result rels order (because
inheritance expansion order is depth-first). So, in order to make the
tuple-routing-related leaf partitions in the same order as that of the
update result rels, we would have to make changes in
RelationGetPartitionDispatchInfo(), which I am not sure whether it is
going to be done as part of the thread "expanding inheritance in
partition bound order" [1]/messages/by-id/CAJ3gD9eyudCNU6V-veMme+eyzfX_ey+gEzULMzOw26c3f9rzdg@mail.gmail.com. For now, in the attached patch, I have
reverted back to the hash table method to find the leaf partitions in
the update result rels.

[1]: /messages/by-id/CAJ3gD9eyudCNU6V-veMme+eyzfX_ey+gEzULMzOw26c3f9rzdg@mail.gmail.com

Thanks
-Amit Khandekar

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#145

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#144)

Re: UPDATE of partition key

On Sun, Sep 3, 2017 at 5:10 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Thanks Dilip. I am working on rebasing the patch. Particularly, the
partition walker in my patch depended on the fact that all the tables
get opened (and then closed) while creating the tuple routing info.
But in HEAD, now only the partitioned tables get opened. So need some
changes in my patch.

The partition walker related changes are going to be inapplicable once
the other thread [1] commits the changes for expansion of inheritence
in bound-order, but till then I would have to rebase the partition
walker changes over HEAD.

[1] /messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp

After recent commit 30833ba154, now the partitions are expanded in
depth-first order. It didn't seem worthwhile rebasing my partition
walker changes onto the latest code. So in the attached patch, I have
removed all the partition walker changes.

It seems you have forgotten to attach the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#146

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#145)

1 attachment(s)

Re: UPDATE of partition key

On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 3, 2017 at 5:10 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 31 August 2017 at 14:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Thanks Dilip. I am working on rebasing the patch. Particularly, the
partition walker in my patch depended on the fact that all the tables
get opened (and then closed) while creating the tuple routing info.
But in HEAD, now only the partitioned tables get opened. So need some
changes in my patch.

The partition walker related changes are going to be inapplicable once
the other thread [1] commits the changes for expansion of inheritence
in bound-order, but till then I would have to rebase the partition
walker changes over HEAD.

[1] /messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp

After recent commit 30833ba154, now the partitions are expanded in
depth-first order. It didn't seem worthwhile rebasing my partition
walker changes onto the latest code. So in the attached patch, I have
removed all the partition walker changes.

It seems you have forgotten to attach the patch.

Oops sorry. Now attached.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v16.patchapplication/octet-stream; name=update-partition-key_v16.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 5016263..a1004a9 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -878,7 +878,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -891,8 +892,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	AttrNumber *part_attnos;
@@ -901,14 +902,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 											 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
-										RelationGetForm(partrel)->reltype,
+										RelationGetDescr(from_rel)->natts,
+										RelationGetForm(to_rel)->reltype,
 										&my_found_whole_row);
 	if (found_whole_row)
 		*found_whole_row = my_found_whole_row;
@@ -2050,6 +2051,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..4ac5bd6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -2446,13 +2446,15 @@ CopyFrom(CopyState cstate)
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2482,7 +2484,7 @@ CopyFrom(CopyState cstate)
 			for (i = 0; i < cstate->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2616,7 +2618,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2726,7 +2728,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2846,7 +2848,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index da0850b..b0fec14 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -67,15 +67,6 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 /* How many levels deep into trigger execution are we? */
 static int	MyTriggerDepth = 0;
 
-/*
- * Note that similar macros also exist in executor/execMain.c.  There does not
- * appear to be any good header to put them into, given the structures that
- * they use, so we let them be duplicated.  Be sure to update all if one needs
- * to be changed, however.
- */
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
-
 /* Local function prototypes */
 static void ConvertTriggerToFK(CreateTrigStmt *stmt, Oid funcoid);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
@@ -2903,8 +2894,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5211,7 +5207,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5260,12 +5261,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_old_tuplestore;
 
 			if (map != NULL)
@@ -5278,12 +5279,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			if (event == TRIGGER_EVENT_INSERT)
 				new_tuplestore = transition_capture->tcs_insert_tuplestore;
 			else
@@ -5306,7 +5307,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2946a0e..7cbc4cb 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -65,6 +65,18 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+/*
+ * Entry of a temporary hash table. During UPDATE tuple routing, we want to
+ * know which of the leaf partitions are present in the UPDATE per-subplan
+ * resultRelInfo array (ModifyTableState->resultRelInfo[]). This hash table
+ * is searchable by the oids of the subplan result rels.
+ */
+typedef struct ResultRelOidsEntry
+{
+	Oid			rel_oid;
+	ResultRelInfo *resultRelInfo;
+} ResultRelOidsEntry;
+
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
@@ -104,19 +116,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
-/*
- * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
- * not appear to be any good header to put it into, given the structures that
- * it uses, so we let them be duplicated.  Be sure to update both if one needs
- * to be changed, however.
- */
-#define GetInsertedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /* end of local decls */
 
@@ -1843,15 +1842,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1879,52 +1873,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1932,7 +1940,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2048,8 +2057,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3235,34 +3245,40 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *		with on entry for every leaf partition (required to convert input tuple
+ *		based on the root table's rowtype to a leaf partition's rowtype after
+ *		tuple routing is done)
  * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
  *		to manipulate any given leaf partition's rowtype after that partition
  *		is chosen by tuple-routing.
  * 'num_parted' receives the number of partitioned tables in the partition
  *		tree (= the number of entries in the 'pd' output array)
  * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *		tree (= the number of entries in the 'partitions' and
+ *		'perleaf_parentchild_maps' output arrays
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
-							   TupleConversionMap ***tup_conv_maps,
+							   ResultRelInfo ***partitions,
+							   TupleConversionMap ***perleaf_parentchild_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3270,7 +3286,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	HTAB	   *result_rel_oids = NULL;
+	HASHCTL		ctl;
+	ResultRelOidsEntry *hash_entry;
+	ResultRelInfo *leaf_part_arr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3279,10 +3298,50 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+											sizeof(ResultRelInfo *));
+	*perleaf_parentchild_maps = (TupleConversionMap **) palloc0(*num_partitions *
+																sizeof(TupleConversionMap *));
+
+	/*
+	 * For Updates, if the leaf partition is already present in the per-subplan
+	 * result rels, we re-use that rather than initialize a new result rel. So
+	 * to find whether a given leaf partition already has a resultRel, we build
+	 * the hash table for searching each of the leaf partitions by oid.
+	 */
+	if (num_update_rri != 0)
+	{
+		ResultRelInfo	   *resultRelInfo;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(ResultRelOidsEntry);
+		ctl.hcxt = CurrentMemoryContext;
+		result_rel_oids = hash_create("result_rel_oids temporary hash",
+								32, /* start small and extend */
+								&ctl,
+								HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+		resultRelInfo = update_rri;
+		for (i = 0; i < num_update_rri; i++, resultRelInfo++)
+		{
+			Oid reloid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			hash_entry = hash_search(result_rel_oids, &reloid,
+									 HASH_ENTER, NULL);
+			hash_entry->resultRelInfo = resultRelInfo;
+		}
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -3292,23 +3351,72 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/*
+			 * If this leaf partition is already present in the per-subplan
+			 * resultRelInfos, re-use that resultRelInfo along with its
+			 * already-opened relation; otherwise create a new result rel.
+			 */
+			hash_entry = hash_search(result_rel_oids, &leaf_oid,
+									 HASH_FIND, NULL);
+			if (hash_entry != NULL)
+			{
+				leaf_part_rri = hash_entry->resultRelInfo;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
+		estate->es_leaf_result_relations =
+			lappend(estate->es_leaf_result_relations, leaf_part_rri);
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
-		 * Verify result relation is a valid target for the current operation.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(partrel, CMD_INSERT);
 
@@ -3316,17 +3424,8 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
-
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
-		estate->es_leaf_result_relations =
-			lappend(estate->es_leaf_result_relations, leaf_part_rri);
+		(*perleaf_parentchild_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+																gettext_noop("could not convert row type"));
 
 		/*
 		 * Open partition indices (remember we do not support ON CONFLICT in
@@ -3337,9 +3436,12 @@ ExecSetupPartitionTupleRouting(Relation rel,
 			leaf_part_rri->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(leaf_part_rri, false);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	if (result_rel_oids != NULL)
+		hash_destroy(result_rel_oids);
 }
 
 /*
@@ -3365,8 +3467,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fbb8108..47afe09 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e12721a..5bc4762 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,7 +54,6 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
-
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
 					 ItemPointer conflictTid,
@@ -240,6 +240,36 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot, TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -281,17 +311,50 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -303,7 +366,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -331,7 +394,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -348,23 +411,11 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -482,7 +533,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -622,6 +673,19 @@ ExecInsert(ModifyTableState *mtstate,
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
 						 mtstate->mt_transition_capture);
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
 	list_free(recheckIndexes);
 
 	/*
@@ -674,6 +738,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -682,6 +748,9 @@ ExecDelete(ModifyTableState *mtstate,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (delete_skipped)
+		*delete_skipped = true;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -845,12 +914,29 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
 						 mtstate->mt_transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -943,6 +1029,8 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
+
 
 	/*
 	 * abort the operation if not running transactions
@@ -1039,12 +1127,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1463,6 +1621,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1470,63 +1667,115 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc);
 
+	if (mtstate->mt_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.
 	 */
-	if (mtstate->mt_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next plan.
+	 * (INSERT operations set it every time.)
+	 */
+	if (mtstate->mt_persubplan_childparent_maps)
+	{
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
+
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
+
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
 	{
-		ResultRelInfo *resultRelInfos;
-		int			numResultRelInfos;
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
 
-		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
 		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For INSERT via partitioned table, so we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
-			numResultRelInfos = mtstate->mt_num_partitions;
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
-			numResultRelInfos = mtstate->mt_nplans;
-		}
-
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
-		for (i = 0; i < numResultRelInfos; ++i)
-		{
-			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time.)
-		 */
-		mtstate->mt_transition_capture->tcs_map =
-			mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1632,9 +1881,9 @@ ExecModifyTable(PlanState *pstate)
 				if (node->mt_transition_capture != NULL)
 				{
 					/* Prepare to convert transition tuples from this child. */
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1750,7 +1999,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1795,9 +2045,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1870,6 +2123,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1907,33 +2169,63 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
-		TupleConversionMap **partition_tupconv_maps;
+		ResultRelInfo **partitions;
+		TupleConversionMap **perleaf_parentchild_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
-									   &partition_tupconv_maps,
+									   &perleaf_parentchild_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = perleaf_parentchild_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
 	/* Build state for collecting transition tuples */
 	ExecSetupTransitionCaptureState(mtstate, estate);
 
@@ -1967,50 +2259,54 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2021,7 +2317,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2058,20 +2354,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2317,6 +2619,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/* Free transition tables */
 	if (node->mt_transition_capture != NULL)
@@ -2353,13 +2656,25 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f9ddf4e..f83fe7c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2260,6 +2261,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8d92c03..f2df72b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9ee3e23..f642bf2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2096,6 +2097,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2518,6 +2520,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 67b9e19..89dd3cf 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 2d7e1d8..8c08d50 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1291,7 +1291,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	rte = planner_rt_fetch(rel->relid, root);
 	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, rel->relid);
+		partitioned_rels = get_partitioned_child_rels(root, rel->relid, NULL);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2821662..85e3126 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2361,6 +2362,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6405,6 +6407,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6431,6 +6434,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9662302..d498349 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1056,6 +1056,7 @@ inheritance_planner(PlannerInfo *root)
 	Index		rti;
 	RangeTblEntry *parent_rte;
 	List	   *partitioned_rels = NIL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1370,9 +1371,15 @@ inheritance_planner(PlannerInfo *root)
 
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
+		Bitmapset  *all_part_cols = NULL;
+
+		partitioned_rels = get_partitioned_child_rels(root, parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/* Result path must go into outer query's FINAL upperrel */
@@ -1429,6 +1436,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2046,6 +2054,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6076,10 +6085,15 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: Only call this function on RTEs known to be partitioned tables.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6091,6 +6105,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index ccf2145..fc7c597 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -107,12 +107,14 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   PlanRowMark *parentrc, PartitionDesc partdesc,
 						   LOCKMODE lockmode,
 						   bool *has_child, List **appinfos,
+						   Bitmapset **all_part_cols,
 						   List **partitioned_child_rels);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *parentrc, Relation childrel,
 								bool *has_child, List **appinfos,
+								Bitmapset **all_part_cols,
 								List **partitioned_child_rels);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
@@ -1397,6 +1399,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	bool		has_child;
 	PartitionedChildRelInfo *pcinfo;
 	List	   *partitioned_child_rels = NIL;
+	Bitmapset  *all_part_cols = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1479,11 +1482,13 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
 										oldrelation,
 										&has_child, &appinfos,
+										&all_part_cols,
 										&partitioned_child_rels);
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 									  RelationGetPartitionDesc(oldrelation),
 									  lockmode,
 									  &has_child, &appinfos,
+									  &all_part_cols,
 									  &partitioned_child_rels);
 	}
 	else
@@ -1519,6 +1524,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
 											newrelation,
 											&has_child, &appinfos,
+											&all_part_cols,
 											&partitioned_child_rels);
 
 			/* Close child relations, but keep locks */
@@ -1558,6 +1564,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 		pcinfo->parent_relid = rti;
 		pcinfo->child_rels = partitioned_child_rels;
+		pcinfo->all_part_cols = all_part_cols;
 		root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 	}
 
@@ -1571,6 +1578,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   PlanRowMark *parentrc, PartitionDesc partdesc,
 						   LOCKMODE lockmode,
 						   bool *has_child, List **appinfos,
+						   Bitmapset **all_part_cols,
 						   List **partitioned_child_rels)
 {
 	int			i;
@@ -1595,6 +1603,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		expand_single_inheritance_child(root, parentrte, parentRTindex,
 										parentrel, parentrc, childrel,
 										has_child, appinfos,
+										all_part_cols,
 										partitioned_child_rels);
 
 		/* If this child is itself partitioned, recurse */
@@ -1604,6 +1613,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 										  RelationGetPartitionDesc(childrel),
 										  lockmode,
 										  has_child, appinfos,
+										  all_part_cols,
 										  partitioned_child_rels);
 
 		/* Close child relation, but keep locks */
@@ -1625,6 +1635,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *parentrc, Relation childrel,
 								bool *has_child, List **appinfos,
+								Bitmapset **all_part_cols,
 								List **partitioned_child_rels)
 {
 	Query	   *parse = root->parse;
@@ -1695,8 +1706,11 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		}
 	}
 	else
+	{
 		*partitioned_child_rels = lappend_int(*partitioned_child_rels,
 											  childRTindex);
+		pull_child_partition_columns(all_part_cols, childrel, parentrel);
+	}
 
 	/*
 	 * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 26567cb..326c858 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3162,6 +3162,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3175,6 +3177,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3242,6 +3245,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index ba706b2..ab72b36 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1224,6 +1224,7 @@ typedef struct
 	/* Target type when converting whole-row vars */
 	Oid			to_rowtype;
 	bool	   *found_whole_row;	/* output flag */
+	bool		coerced_var;	/* var is under ConvertRowTypeExpr */
 } map_variable_attnos_context;
 
 static Node *
@@ -1267,22 +1268,29 @@ map_variable_attnos_mutator(Node *node,
 					/* Don't convert unless necessary. */
 					if (context->to_rowtype != var->vartype)
 					{
-						ConvertRowtypeExpr *r;
-
 						/* Var itself is converted to the requested type. */
 						newvar->vartype = context->to_rowtype;
 
 						/*
-						 * And a conversion node on top to convert back to the
-						 * original type.
+						 * If this var is already under a ConvertRowtypeExpr,
+						 * we don't have to add another one.
 						 */
-						r = makeNode(ConvertRowtypeExpr);
-						r->arg = (Expr *) newvar;
-						r->resulttype = var->vartype;
-						r->convertformat = COERCE_IMPLICIT_CAST;
-						r->location = -1;
-
-						return (Node *) r;
+						if (!context->coerced_var)
+						{
+							ConvertRowtypeExpr *r;
+
+							/*
+							 * And a conversion node on top to convert back to
+							 * the original type.
+							 */
+							r = makeNode(ConvertRowtypeExpr);
+							r->arg = (Expr *) newvar;
+							r->resulttype = var->vartype;
+							r->convertformat = COERCE_IMPLICIT_CAST;
+							r->location = -1;
+
+							return (Node *) r;
+						}
 					}
 				}
 			}
@@ -1290,6 +1298,28 @@ map_variable_attnos_mutator(Node *node,
 		}
 		/* otherwise fall through to copy the var normally */
 	}
+	else if (IsA(node, ConvertRowtypeExpr))
+	{
+		ConvertRowtypeExpr *r = (ConvertRowtypeExpr *) node;
+
+		/*
+		 * If this is coercing a var (which is typical), convert only the var,
+		 * as against adding another ConvertRowtypeExpr over it.
+		 */
+		if (IsA(r->arg, Var))
+		{
+			ConvertRowtypeExpr *newnode;
+
+			newnode = (ConvertRowtypeExpr *) palloc(sizeof(ConvertRowtypeExpr));
+			*newnode = *r;
+			context->coerced_var = true;
+			newnode->arg = (Expr *) map_variable_attnos_mutator((Node *) r->arg, context);
+			context->coerced_var = false;
+
+			return (Node *) newnode;
+		}
+		/* Else fall through the expression tree mutator */
+	}
 	else if (IsA(node, Query))
 	{
 		/* Recurse into RTE subquery or not-yet-planned sublink subquery */
@@ -1321,6 +1351,7 @@ map_variable_attnos(Node *node,
 	context.map_length = map_length;
 	context.to_rowtype = to_rowtype;
 	context.found_whole_row = found_whole_row;
+	context.coerced_var = false;
 
 	*found_whole_row = false;
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c67..2e29276 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -80,12 +86,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int *num_parted, List **leaf_part_oids);
@@ -99,4 +109,8 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
+
 #endif							/* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index f48a603..67c2c9f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,10 +210,12 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60ab..3034b01 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -511,6 +511,11 @@ typedef struct EState
 	struct dsa_area *es_query_dsa;
 } EState;
 
+/* For a given result relation, get its columns being inserted/updated. */
+#define GetInsertedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /*
  * ExecRowMark -
@@ -978,14 +983,31 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..6981f58 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index a39e59d..e3ff127 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2021,6 +2022,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e372f88..b38f2f1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..6c0036b 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,425 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_c_100_200
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_100_200
+         Filter: (c > '97'::numeric)
+(16 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted | a | b  | c  
+--------------+---+----+----
+ (b,15,95)    | b | 15 | 95
+ (b,17,95)    | b | 17 | 95
+(2 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  95
+ part_c_1_100   | b | 17 |  95
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,110), (b,13,98), (b,15,106), (b,17,106)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 110
+ part_c_100_200 | b | 15 | 106
+ part_c_100_200 | b | 17 | 106
+ part_c_1_100   | b | 13 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,146), (b,13,147), (b,15,155), (b,17,155)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 146
+ part_c_100_200 | b | 13 | 147
+ part_c_100_200 | b | 15 | 155
+ part_c_100_200 | b | 17 | 155
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 110
+ part_c_100_200 | b | 17 | 106
+ part_c_100_200 | b | 19 | 106
+ part_c_1_100   | b | 15 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 146
+ part_c_100_200 | b | 16 | 147
+ part_c_100_200 | b | 17 | 155
+ part_c_100_200 | b | 19 | 155
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 150
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  55
+ part_c_1_100   | b | 17 |  55
+(6 rows)
+
+drop table mintab, range_parted CASCADE;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..da5130d 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,253 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trans_updatetrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop table mintab, range_parted CASCADE;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#147

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#146)

Re: UPDATE of partition key

On Mon, Sep 4, 2017 at 10:52 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote:
Oops sorry. Now attached.

I have done some basic testing and initial review of the patch. I
have some comments/doubts. I will continue the review.

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,

For passing invalid ItemPointer we are using InvalidOid, this seems
bit odd to me
are we using simmilar convention some other place? I think it would be better to
just pass 0?

------

- if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
- (event == TRIGGER_EVENT_UPDATE && update_old_table))
+ if (oldtup != NULL &&
+ ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+ (event == TRIGGER_EVENT_UPDATE && update_old_table)))
  {
  Tuplestorestate *old_tuplestore;

- Assert(oldtup != NULL);

Only if TRIGGER_EVENT_UPDATE it is possible that oldtup can be NULL,
so we have added an extra
check for oldtup and removed the Assert, but if TRIGGER_EVENT_DELETE
we never expect it to be NULL.

Is it better to put Assert outside the condition check (Assert(oldtup
!= NULL || event == TRIGGER_EVENT_UPDATE)) ?
same for the newtup.

I think we should also explain in comments about why oldtup or newtup
can be NULL in case of if
TRIGGER_EVENT_UPDATE

-------

+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>.

Above comments says that ARUpdate trigger is not fired but below code call
ARUpdateTrigger

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
+ NULL,
+ tuple,
+ NULL,
+ mtstate->mt_transition_capture);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#148

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#147)

1 attachment(s)

Re: UPDATE of partition key

On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Sep 4, 2017 at 10:52 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 4 September 2017 at 07:43, Amit Kapila <amit.kapila16@gmail.com> wrote:
Oops sorry. Now attached.

I have done some basic testing and initial review of the patch. I

Thanks for taking this up for review. Attached is the updated patch
v17, that covers the below points.

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
For passing invalid ItemPointer we are using InvalidOid, this seems
bit odd to me
are we using simmilar convention some other place? I think it would be better to
just pass 0?

Yes that's right. Replaced InvalidOid by NULL since ItemPointer is a pointer.

------
- if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
- (event == TRIGGER_EVENT_UPDATE && update_old_table))
+ if (oldtup != NULL &&
+ ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+ (event == TRIGGER_EVENT_UPDATE && update_old_table)))
{
Tuplestorestate *old_tuplestore;
- Assert(oldtup != NULL);

Only if TRIGGER_EVENT_UPDATE it is possible that oldtup can be NULL,
so we have added an extra
check for oldtup and removed the Assert, but if TRIGGER_EVENT_DELETE
we never expect it to be NULL.

Is it better to put Assert outside the condition check (Assert(oldtup
!= NULL || event == TRIGGER_EVENT_UPDATE)) ?
same for the newtup.

I think we should also explain in comments about why oldtup or newtup
can be NULL in case of if
TRIGGER_EVENT_UPDATE

Done all the above. Added two separate asserts, one for DELETE and the
other for INSERT.

-------

+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>.

Above comments says that ARUpdate trigger is not fired but below code call
ARUpdateTrigger

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
+ NULL,
+ tuple,
+ NULL,
+ mtstate->mt_transition_capture);

Actually, since transition tables came in, the functions like
ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
purpose of capturing transition table rows, so that the images of the
tables are visible when statement triggers are fired that refer to
these transition tables. So in the above code, these functions only
capture rows, they do not add any event for firing any ROW triggers.
AfterTriggerSaveEvent() returns without adding any event if it's
called only for transition capture. So even if UPDATE row triggers are
defined, they won't get fired in case of row movement, although the
updated rows would be captured if transition tables are referenced in
these triggers or in the statement triggers.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v17.patchapplication/octet-stream; name=update-partition-key_v17.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 5016263..a1004a9 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -878,7 +878,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -891,8 +892,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	AttrNumber *part_attnos;
@@ -901,14 +902,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 											 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
-										RelationGetForm(partrel)->reltype,
+										RelationGetDescr(from_rel)->natts,
+										RelationGetForm(to_rel)->reltype,
 										&my_found_whole_row);
 	if (found_whole_row)
 		*found_whole_row = my_found_whole_row;
@@ -2050,6 +2051,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..4ac5bd6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -2446,13 +2446,15 @@ CopyFrom(CopyState cstate)
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2482,7 +2484,7 @@ CopyFrom(CopyState cstate)
 			for (i = 0; i < cstate->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2616,7 +2618,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2726,7 +2728,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2846,7 +2848,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index da0850b..6904c4e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -67,15 +67,6 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 /* How many levels deep into trigger execution are we? */
 static int	MyTriggerDepth = 0;
 
-/*
- * Note that similar macros also exist in executor/execMain.c.  There does not
- * appear to be any good header to put them into, given the structures that
- * they use, so we let them be duplicated.  Be sure to update all if one needs
- * to be changed, however.
- */
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
-
 /* Local function prototypes */
 static void ConvertTriggerToFK(CreateTrigStmt *stmt, Oid funcoid);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
@@ -2903,8 +2894,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5211,7 +5207,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5260,12 +5261,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_old_tuplestore;
 
 			if (map != NULL)
@@ -5278,12 +5294,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			if (event == TRIGGER_EVENT_INSERT)
 				new_tuplestore = transition_capture->tcs_insert_tuplestore;
 			else
@@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2946a0e..7cbc4cb 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -65,6 +65,18 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+/*
+ * Entry of a temporary hash table. During UPDATE tuple routing, we want to
+ * know which of the leaf partitions are present in the UPDATE per-subplan
+ * resultRelInfo array (ModifyTableState->resultRelInfo[]). This hash table
+ * is searchable by the oids of the subplan result rels.
+ */
+typedef struct ResultRelOidsEntry
+{
+	Oid			rel_oid;
+	ResultRelInfo *resultRelInfo;
+} ResultRelOidsEntry;
+
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
@@ -104,19 +116,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
-/*
- * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
- * not appear to be any good header to put it into, given the structures that
- * it uses, so we let them be duplicated.  Be sure to update both if one needs
- * to be changed, however.
- */
-#define GetInsertedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /* end of local decls */
 
@@ -1843,15 +1842,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1879,52 +1873,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1932,7 +1940,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2048,8 +2057,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3235,34 +3245,40 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *		with on entry for every leaf partition (required to convert input tuple
+ *		based on the root table's rowtype to a leaf partition's rowtype after
+ *		tuple routing is done)
  * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
  *		to manipulate any given leaf partition's rowtype after that partition
  *		is chosen by tuple-routing.
  * 'num_parted' receives the number of partitioned tables in the partition
  *		tree (= the number of entries in the 'pd' output array)
  * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *		tree (= the number of entries in the 'partitions' and
+ *		'perleaf_parentchild_maps' output arrays
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
-							   TupleConversionMap ***tup_conv_maps,
+							   ResultRelInfo ***partitions,
+							   TupleConversionMap ***perleaf_parentchild_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3270,7 +3286,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	HTAB	   *result_rel_oids = NULL;
+	HASHCTL		ctl;
+	ResultRelOidsEntry *hash_entry;
+	ResultRelInfo *leaf_part_arr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3279,10 +3298,50 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+											sizeof(ResultRelInfo *));
+	*perleaf_parentchild_maps = (TupleConversionMap **) palloc0(*num_partitions *
+																sizeof(TupleConversionMap *));
+
+	/*
+	 * For Updates, if the leaf partition is already present in the per-subplan
+	 * result rels, we re-use that rather than initialize a new result rel. So
+	 * to find whether a given leaf partition already has a resultRel, we build
+	 * the hash table for searching each of the leaf partitions by oid.
+	 */
+	if (num_update_rri != 0)
+	{
+		ResultRelInfo	   *resultRelInfo;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(ResultRelOidsEntry);
+		ctl.hcxt = CurrentMemoryContext;
+		result_rel_oids = hash_create("result_rel_oids temporary hash",
+								32, /* start small and extend */
+								&ctl,
+								HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+		resultRelInfo = update_rri;
+		for (i = 0; i < num_update_rri; i++, resultRelInfo++)
+		{
+			Oid reloid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			hash_entry = hash_search(result_rel_oids, &reloid,
+									 HASH_ENTER, NULL);
+			hash_entry->resultRelInfo = resultRelInfo;
+		}
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -3292,23 +3351,72 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/*
+			 * If this leaf partition is already present in the per-subplan
+			 * resultRelInfos, re-use that resultRelInfo along with its
+			 * already-opened relation; otherwise create a new result rel.
+			 */
+			hash_entry = hash_search(result_rel_oids, &leaf_oid,
+									 HASH_FIND, NULL);
+			if (hash_entry != NULL)
+			{
+				leaf_part_rri = hash_entry->resultRelInfo;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
+		estate->es_leaf_result_relations =
+			lappend(estate->es_leaf_result_relations, leaf_part_rri);
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
-		 * Verify result relation is a valid target for the current operation.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(partrel, CMD_INSERT);
 
@@ -3316,17 +3424,8 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
-
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
-		estate->es_leaf_result_relations =
-			lappend(estate->es_leaf_result_relations, leaf_part_rri);
+		(*perleaf_parentchild_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+																gettext_noop("could not convert row type"));
 
 		/*
 		 * Open partition indices (remember we do not support ON CONFLICT in
@@ -3337,9 +3436,12 @@ ExecSetupPartitionTupleRouting(Relation rel,
 			leaf_part_rri->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(leaf_part_rri, false);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	if (result_rel_oids != NULL)
+		hash_destroy(result_rel_oids);
 }
 
 /*
@@ -3365,8 +3467,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fbb8108..47afe09 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e12721a..0ce4355 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,7 +54,6 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
-
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
 					 ItemPointer conflictTid,
@@ -240,6 +240,36 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot, TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -281,17 +311,50 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -303,7 +366,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -331,7 +394,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -348,23 +411,11 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -482,7 +533,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -622,6 +673,19 @@ ExecInsert(ModifyTableState *mtstate,
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
 						 mtstate->mt_transition_capture);
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
 	list_free(recheckIndexes);
 
 	/*
@@ -674,6 +738,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -682,6 +748,9 @@ ExecDelete(ModifyTableState *mtstate,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (delete_skipped)
+		*delete_skipped = true;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -845,12 +914,29 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
 						 mtstate->mt_transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -943,6 +1029,8 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
+
 
 	/*
 	 * abort the operation if not running transactions
@@ -1039,12 +1127,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1463,6 +1621,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1470,63 +1667,115 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc);
 
+	if (mtstate->mt_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.
 	 */
-	if (mtstate->mt_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next plan.
+	 * (INSERT operations set it every time.)
+	 */
+	if (mtstate->mt_persubplan_childparent_maps)
+	{
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
+
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
+
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
 	{
-		ResultRelInfo *resultRelInfos;
-		int			numResultRelInfos;
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
 
-		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
 		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For INSERT via partitioned table, so we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
-			numResultRelInfos = mtstate->mt_num_partitions;
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
-			numResultRelInfos = mtstate->mt_nplans;
-		}
-
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
-		for (i = 0; i < numResultRelInfos; ++i)
-		{
-			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time.)
-		 */
-		mtstate->mt_transition_capture->tcs_map =
-			mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1632,9 +1881,9 @@ ExecModifyTable(PlanState *pstate)
 				if (node->mt_transition_capture != NULL)
 				{
 					/* Prepare to convert transition tuples from this child. */
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1750,7 +1999,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1795,9 +2045,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1870,6 +2123,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1907,33 +2169,63 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
-		TupleConversionMap **partition_tupconv_maps;
+		ResultRelInfo **partitions;
+		TupleConversionMap **perleaf_parentchild_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
-									   &partition_tupconv_maps,
+									   &perleaf_parentchild_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = perleaf_parentchild_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
 	/* Build state for collecting transition tuples */
 	ExecSetupTransitionCaptureState(mtstate, estate);
 
@@ -1967,50 +2259,54 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2021,7 +2317,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2058,20 +2354,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2317,6 +2619,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/* Free transition tables */
 	if (node->mt_transition_capture != NULL)
@@ -2353,13 +2656,25 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9bae264..3cdbd97 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2260,6 +2261,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 11731da..a410e46 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9ee3e23..f642bf2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2096,6 +2097,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2518,6 +2520,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 67b9e19..89dd3cf 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 2d7e1d8..8c08d50 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1291,7 +1291,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	rte = planner_rt_fetch(rel->relid, root);
 	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, rel->relid);
+		partitioned_rels = get_partitioned_child_rels(root, rel->relid, NULL);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2821662..85e3126 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2361,6 +2362,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6405,6 +6407,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6431,6 +6434,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6b79b3a..68e0302 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1056,6 +1056,7 @@ inheritance_planner(PlannerInfo *root)
 	Index		rti;
 	RangeTblEntry *parent_rte;
 	List	   *partitioned_rels = NIL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1370,9 +1371,15 @@ inheritance_planner(PlannerInfo *root)
 
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
+		Bitmapset  *all_part_cols = NULL;
+
+		partitioned_rels = get_partitioned_child_rels(root, parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/* Result path must go into outer query's FINAL upperrel */
@@ -1429,6 +1436,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2046,6 +2054,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6076,10 +6085,15 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: Only call this function on RTEs known to be partitioned tables.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6091,6 +6105,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index ccf2145..fc7c597 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -107,12 +107,14 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   PlanRowMark *parentrc, PartitionDesc partdesc,
 						   LOCKMODE lockmode,
 						   bool *has_child, List **appinfos,
+						   Bitmapset **all_part_cols,
 						   List **partitioned_child_rels);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *parentrc, Relation childrel,
 								bool *has_child, List **appinfos,
+								Bitmapset **all_part_cols,
 								List **partitioned_child_rels);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
@@ -1397,6 +1399,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	bool		has_child;
 	PartitionedChildRelInfo *pcinfo;
 	List	   *partitioned_child_rels = NIL;
+	Bitmapset  *all_part_cols = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1479,11 +1482,13 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
 										oldrelation,
 										&has_child, &appinfos,
+										&all_part_cols,
 										&partitioned_child_rels);
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 									  RelationGetPartitionDesc(oldrelation),
 									  lockmode,
 									  &has_child, &appinfos,
+									  &all_part_cols,
 									  &partitioned_child_rels);
 	}
 	else
@@ -1519,6 +1524,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
 											newrelation,
 											&has_child, &appinfos,
+											&all_part_cols,
 											&partitioned_child_rels);
 
 			/* Close child relations, but keep locks */
@@ -1558,6 +1564,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 		pcinfo->parent_relid = rti;
 		pcinfo->child_rels = partitioned_child_rels;
+		pcinfo->all_part_cols = all_part_cols;
 		root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 	}
 
@@ -1571,6 +1578,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   PlanRowMark *parentrc, PartitionDesc partdesc,
 						   LOCKMODE lockmode,
 						   bool *has_child, List **appinfos,
+						   Bitmapset **all_part_cols,
 						   List **partitioned_child_rels)
 {
 	int			i;
@@ -1595,6 +1603,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		expand_single_inheritance_child(root, parentrte, parentRTindex,
 										parentrel, parentrc, childrel,
 										has_child, appinfos,
+										all_part_cols,
 										partitioned_child_rels);
 
 		/* If this child is itself partitioned, recurse */
@@ -1604,6 +1613,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 										  RelationGetPartitionDesc(childrel),
 										  lockmode,
 										  has_child, appinfos,
+										  all_part_cols,
 										  partitioned_child_rels);
 
 		/* Close child relation, but keep locks */
@@ -1625,6 +1635,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *parentrc, Relation childrel,
 								bool *has_child, List **appinfos,
+								Bitmapset **all_part_cols,
 								List **partitioned_child_rels)
 {
 	Query	   *parse = root->parse;
@@ -1695,8 +1706,11 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		}
 	}
 	else
+	{
 		*partitioned_child_rels = lappend_int(*partitioned_child_rels,
 											  childRTindex);
+		pull_child_partition_columns(all_part_cols, childrel, parentrel);
+	}
 
 	/*
 	 * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 26567cb..326c858 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3162,6 +3162,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3175,6 +3177,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3242,6 +3245,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index ba706b2..ab72b36 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1224,6 +1224,7 @@ typedef struct
 	/* Target type when converting whole-row vars */
 	Oid			to_rowtype;
 	bool	   *found_whole_row;	/* output flag */
+	bool		coerced_var;	/* var is under ConvertRowTypeExpr */
 } map_variable_attnos_context;
 
 static Node *
@@ -1267,22 +1268,29 @@ map_variable_attnos_mutator(Node *node,
 					/* Don't convert unless necessary. */
 					if (context->to_rowtype != var->vartype)
 					{
-						ConvertRowtypeExpr *r;
-
 						/* Var itself is converted to the requested type. */
 						newvar->vartype = context->to_rowtype;
 
 						/*
-						 * And a conversion node on top to convert back to the
-						 * original type.
+						 * If this var is already under a ConvertRowtypeExpr,
+						 * we don't have to add another one.
 						 */
-						r = makeNode(ConvertRowtypeExpr);
-						r->arg = (Expr *) newvar;
-						r->resulttype = var->vartype;
-						r->convertformat = COERCE_IMPLICIT_CAST;
-						r->location = -1;
-
-						return (Node *) r;
+						if (!context->coerced_var)
+						{
+							ConvertRowtypeExpr *r;
+
+							/*
+							 * And a conversion node on top to convert back to
+							 * the original type.
+							 */
+							r = makeNode(ConvertRowtypeExpr);
+							r->arg = (Expr *) newvar;
+							r->resulttype = var->vartype;
+							r->convertformat = COERCE_IMPLICIT_CAST;
+							r->location = -1;
+
+							return (Node *) r;
+						}
 					}
 				}
 			}
@@ -1290,6 +1298,28 @@ map_variable_attnos_mutator(Node *node,
 		}
 		/* otherwise fall through to copy the var normally */
 	}
+	else if (IsA(node, ConvertRowtypeExpr))
+	{
+		ConvertRowtypeExpr *r = (ConvertRowtypeExpr *) node;
+
+		/*
+		 * If this is coercing a var (which is typical), convert only the var,
+		 * as against adding another ConvertRowtypeExpr over it.
+		 */
+		if (IsA(r->arg, Var))
+		{
+			ConvertRowtypeExpr *newnode;
+
+			newnode = (ConvertRowtypeExpr *) palloc(sizeof(ConvertRowtypeExpr));
+			*newnode = *r;
+			context->coerced_var = true;
+			newnode->arg = (Expr *) map_variable_attnos_mutator((Node *) r->arg, context);
+			context->coerced_var = false;
+
+			return (Node *) newnode;
+		}
+		/* Else fall through the expression tree mutator */
+	}
 	else if (IsA(node, Query))
 	{
 		/* Recurse into RTE subquery or not-yet-planned sublink subquery */
@@ -1321,6 +1351,7 @@ map_variable_attnos(Node *node,
 	context.map_length = map_length;
 	context.to_rowtype = to_rowtype;
 	context.found_whole_row = found_whole_row;
+	context.coerced_var = false;
 
 	*found_whole_row = false;
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c67..2e29276 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -80,12 +86,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int *num_parted, List **leaf_part_oids);
@@ -99,4 +109,8 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
+
 #endif							/* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index f48a603..67c2c9f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,10 +210,12 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60ab..3034b01 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -511,6 +511,11 @@ typedef struct EState
 	struct dsa_area *es_query_dsa;
 } EState;
 
+/* For a given result relation, get its columns being inserted/updated. */
+#define GetInsertedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /*
  * ExecRowMark -
@@ -978,14 +983,31 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..6981f58 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index a39e59d..e3ff127 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2021,6 +2022,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e372f88..b38f2f1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..6c0036b 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,425 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_c_100_200
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_100_200
+         Filter: (c > '97'::numeric)
+(16 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted | a | b  | c  
+--------------+---+----+----
+ (b,15,95)    | b | 15 | 95
+ (b,17,95)    | b | 17 | 95
+(2 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  95
+ part_c_1_100   | b | 17 |  95
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,110), (b,13,98), (b,15,106), (b,17,106)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 110
+ part_c_100_200 | b | 15 | 106
+ part_c_100_200 | b | 17 | 106
+ part_c_1_100   | b | 13 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,146), (b,13,147), (b,15,155), (b,17,155)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 146
+ part_c_100_200 | b | 13 | 147
+ part_c_100_200 | b | 15 | 155
+ part_c_100_200 | b | 17 | 155
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 110
+ part_c_100_200 | b | 17 | 106
+ part_c_100_200 | b | 19 | 106
+ part_c_1_100   | b | 15 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 146
+ part_c_100_200 | b | 16 | 147
+ part_c_100_200 | b | 17 | 155
+ part_c_100_200 | b | 19 | 155
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 150
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  55
+ part_c_1_100   | b | 17 |  55
+(6 rows)
+
+drop table mintab, range_parted CASCADE;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..da5130d 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,253 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trans_updatetrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop table mintab, range_parted CASCADE;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#149

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#144)

Re: UPDATE of partition key

On 3 September 2017 at 17:10, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

After recent commit 30833ba154, now the partitions are expanded in
depth-first order. It didn't seem worthwhile rebasing my partition
walker changes onto the latest code. So in the attached patch, I have
removed all the partition walker changes. But
RelationGetPartitionDispatchInfo() traverses in breadth-first order,
which is different than the update result rels order (because
inheritance expansion order is depth-first). So, in order to make the
tuple-routing-related leaf partitions in the same order as that of the
update result rels, we would have to make changes in
RelationGetPartitionDispatchInfo(), which I am not sure whether it is
going to be done as part of the thread "expanding inheritance in
partition bound order" [1]. For now, in the attached patch, I have
reverted back to the hash table method to find the leaf partitions in
the update result rels.

[1] /messages/by-id/CAJ3gD9eyudCNU6V-veMme+eyzfX_ey+gEzULMzOw26c3f9rzdg@mail.gmail.com

As mentioned by Amit Langote in the above mail thread, he is going to
do changes for making RelationGetPartitionDispatchInfo() return the
leaf partitions in depth-first order. Once that is done, I will then
remove the hash table method for finding leaf partitions in update
result rels, and instead use the earlier efficient method that takes
advantage of the fact that update result rels and leaf partitions are
in the same order.

Thanks
-Amit Khandekar

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#150

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#149)

1 attachment(s)

Re: UPDATE of partition key

Attached is the patch rebased on latest HEAD.

Thanks
-Amit Khandekar

Attachments:

update-partition-key_v17_rebased.patchapplication/octet-stream; name=update-partition-key_v17_rebased.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index c6bd02f..7539dde 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -878,7 +878,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -891,8 +892,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	AttrNumber *part_attnos;
@@ -901,14 +902,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	if (expr == NIL)
 		return NIL;
 
-	part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-											 RelationGetDescr(parent),
+	part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+											 RelationGetDescr(from_rel),
 											 gettext_noop("could not convert row type"));
 	expr = (List *) map_variable_attnos((Node *) expr,
-										target_varno, 0,
+										fromrel_varno, 0,
 										part_attnos,
-										RelationGetDescr(parent)->natts,
-										RelationGetForm(partrel)->reltype,
+										RelationGetDescr(from_rel)->natts,
+										RelationGetForm(to_rel)->reltype,
 										&my_found_whole_row);
 	if (found_whole_row)
 		*found_whole_row = my_found_whole_row;
@@ -2054,6 +2055,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..4ac5bd6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -2446,13 +2446,15 @@ CopyFrom(CopyState cstate)
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2482,7 +2484,7 @@ CopyFrom(CopyState cstate)
 			for (i = 0; i < cstate->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2616,7 +2618,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2726,7 +2728,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2846,7 +2848,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index da0850b..6904c4e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -67,15 +67,6 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 /* How many levels deep into trigger execution are we? */
 static int	MyTriggerDepth = 0;
 
-/*
- * Note that similar macros also exist in executor/execMain.c.  There does not
- * appear to be any good header to put them into, given the structures that
- * they use, so we let them be duplicated.  Be sure to update all if one needs
- * to be changed, however.
- */
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
-
 /* Local function prototypes */
 static void ConvertTriggerToFK(CreateTrigStmt *stmt, Oid funcoid);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
@@ -2903,8 +2894,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5211,7 +5207,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5260,12 +5261,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_old_tuplestore;
 
 			if (map != NULL)
@@ -5278,12 +5294,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			if (event == TRIGGER_EVENT_INSERT)
 				new_tuplestore = transition_capture->tcs_insert_tuplestore;
 			else
@@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b594d4..1508f72 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -65,6 +65,18 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+/*
+ * Entry of a temporary hash table. During UPDATE tuple routing, we want to
+ * know which of the leaf partitions are present in the UPDATE per-subplan
+ * resultRelInfo array (ModifyTableState->resultRelInfo[]). This hash table
+ * is searchable by the oids of the subplan result rels.
+ */
+typedef struct ResultRelOidsEntry
+{
+	Oid			rel_oid;
+	ResultRelInfo *resultRelInfo;
+} ResultRelOidsEntry;
+
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
@@ -104,19 +116,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
-/*
- * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
- * not appear to be any good header to put it into, given the structures that
- * it uses, so we let them be duplicated.  Be sure to update both if one needs
- * to be changed, however.
- */
-#define GetInsertedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /* end of local decls */
 
@@ -1850,15 +1849,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1880,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1947,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2064,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3242,34 +3252,40 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *		with on entry for every leaf partition (required to convert input tuple
+ *		based on the root table's rowtype to a leaf partition's rowtype after
+ *		tuple routing is done)
  * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
  *		to manipulate any given leaf partition's rowtype after that partition
  *		is chosen by tuple-routing.
  * 'num_parted' receives the number of partitioned tables in the partition
  *		tree (= the number of entries in the 'pd' output array)
  * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *		tree (= the number of entries in the 'partitions' and
+ *		'perleaf_parentchild_maps' output arrays
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
-							   TupleConversionMap ***tup_conv_maps,
+							   ResultRelInfo ***partitions,
+							   TupleConversionMap ***perleaf_parentchild_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3277,7 +3293,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	HTAB	   *result_rel_oids = NULL;
+	HASHCTL		ctl;
+	ResultRelOidsEntry *hash_entry;
+	ResultRelInfo *leaf_part_arr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3286,10 +3305,50 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+											sizeof(ResultRelInfo *));
+	*perleaf_parentchild_maps = (TupleConversionMap **) palloc0(*num_partitions *
+																sizeof(TupleConversionMap *));
+
+	/*
+	 * For Updates, if the leaf partition is already present in the per-subplan
+	 * result rels, we re-use that rather than initialize a new result rel. So
+	 * to find whether a given leaf partition already has a resultRel, we build
+	 * the hash table for searching each of the leaf partitions by oid.
+	 */
+	if (num_update_rri != 0)
+	{
+		ResultRelInfo	   *resultRelInfo;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(ResultRelOidsEntry);
+		ctl.hcxt = CurrentMemoryContext;
+		result_rel_oids = hash_create("result_rel_oids temporary hash",
+								32, /* start small and extend */
+								&ctl,
+								HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+		resultRelInfo = update_rri;
+		for (i = 0; i < num_update_rri; i++, resultRelInfo++)
+		{
+			Oid reloid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			hash_entry = hash_search(result_rel_oids, &reloid,
+									 HASH_ENTER, NULL);
+			hash_entry->resultRelInfo = resultRelInfo;
+		}
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -3299,36 +3358,76 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/*
+			 * If this leaf partition is already present in the per-subplan
+			 * resultRelInfos, re-use that resultRelInfo along with its
+			 * already-opened relation; otherwise create a new result rel.
+			 */
+			hash_entry = hash_search(result_rel_oids, &leaf_oid,
+									 HASH_FIND, NULL);
+			if (hash_entry != NULL)
+			{
+				leaf_part_rri = hash_entry->resultRelInfo;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
-
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
+		(*perleaf_parentchild_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+																gettext_noop("could not convert row type"));
 
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3344,9 +3443,12 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	if (result_rel_oids != NULL)
+		hash_destroy(result_rel_oids);
 }
 
 /*
@@ -3372,8 +3474,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fbb8108..47afe09 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index bd84778..ecf51db 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,7 +54,6 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
-
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
 					 ItemPointer conflictTid,
@@ -240,6 +240,36 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot, TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -281,17 +311,50 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -303,7 +366,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -331,7 +394,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -348,23 +411,11 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -482,7 +533,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -622,6 +673,19 @@ ExecInsert(ModifyTableState *mtstate,
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
 						 mtstate->mt_transition_capture);
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
 	list_free(recheckIndexes);
 
 	/*
@@ -674,6 +738,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -682,6 +748,9 @@ ExecDelete(ModifyTableState *mtstate,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (delete_skipped)
+		*delete_skipped = true;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -845,12 +914,29 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
 						 mtstate->mt_transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -943,6 +1029,8 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
+
 
 	/*
 	 * abort the operation if not running transactions
@@ -1039,12 +1127,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1463,6 +1621,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1470,63 +1667,115 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc);
 
+	if (mtstate->mt_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.
 	 */
-	if (mtstate->mt_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next plan.
+	 * (INSERT operations set it every time.)
+	 */
+	if (mtstate->mt_persubplan_childparent_maps)
+	{
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
+
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
+
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
 	{
-		ResultRelInfo *resultRelInfos;
-		int			numResultRelInfos;
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
 
-		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
 		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For INSERT via partitioned table, so we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
-			numResultRelInfos = mtstate->mt_num_partitions;
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
-			numResultRelInfos = mtstate->mt_nplans;
-		}
-
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
-		for (i = 0; i < numResultRelInfos; ++i)
-		{
-			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time.)
-		 */
-		mtstate->mt_transition_capture->tcs_map =
-			mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1632,9 +1881,9 @@ ExecModifyTable(PlanState *pstate)
 				if (node->mt_transition_capture != NULL)
 				{
 					/* Prepare to convert transition tuples from this child. */
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1750,7 +1999,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1795,9 +2045,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1870,6 +2123,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1907,33 +2169,63 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
-		TupleConversionMap **partition_tupconv_maps;
+		ResultRelInfo **partitions;
+		TupleConversionMap **perleaf_parentchild_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
-									   &partition_tupconv_maps,
+									   &perleaf_parentchild_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = perleaf_parentchild_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
 	/* Build state for collecting transition tuples */
 	ExecSetupTransitionCaptureState(mtstate, estate);
 
@@ -1967,50 +2259,54 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2021,7 +2317,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2058,20 +2354,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2317,6 +2619,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/* Free transition tables */
 	if (node->mt_transition_capture != NULL)
@@ -2353,13 +2656,25 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9bae264..3cdbd97 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2260,6 +2261,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 11731da..a410e46 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9ee3e23..f642bf2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2096,6 +2097,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2518,6 +2520,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 67b9e19..89dd3cf 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 2d7e1d8..8c08d50 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1291,7 +1291,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	rte = planner_rt_fetch(rel->relid, root);
 	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, rel->relid);
+		partitioned_rels = get_partitioned_child_rels(root, rel->relid, NULL);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2821662..85e3126 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2361,6 +2362,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6405,6 +6407,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6431,6 +6434,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6b79b3a..68e0302 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1056,6 +1056,7 @@ inheritance_planner(PlannerInfo *root)
 	Index		rti;
 	RangeTblEntry *parent_rte;
 	List	   *partitioned_rels = NIL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1370,9 +1371,15 @@ inheritance_planner(PlannerInfo *root)
 
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
+		Bitmapset  *all_part_cols = NULL;
+
+		partitioned_rels = get_partitioned_child_rels(root, parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/* Result path must go into outer query's FINAL upperrel */
@@ -1429,6 +1436,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2046,6 +2054,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6076,10 +6085,15 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: Only call this function on RTEs known to be partitioned tables.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6091,6 +6105,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index ccf2145..fc7c597 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -107,12 +107,14 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   PlanRowMark *parentrc, PartitionDesc partdesc,
 						   LOCKMODE lockmode,
 						   bool *has_child, List **appinfos,
+						   Bitmapset **all_part_cols,
 						   List **partitioned_child_rels);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *parentrc, Relation childrel,
 								bool *has_child, List **appinfos,
+								Bitmapset **all_part_cols,
 								List **partitioned_child_rels);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
@@ -1397,6 +1399,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	bool		has_child;
 	PartitionedChildRelInfo *pcinfo;
 	List	   *partitioned_child_rels = NIL;
+	Bitmapset  *all_part_cols = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1479,11 +1482,13 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
 										oldrelation,
 										&has_child, &appinfos,
+										&all_part_cols,
 										&partitioned_child_rels);
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 									  RelationGetPartitionDesc(oldrelation),
 									  lockmode,
 									  &has_child, &appinfos,
+									  &all_part_cols,
 									  &partitioned_child_rels);
 	}
 	else
@@ -1519,6 +1524,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
 											newrelation,
 											&has_child, &appinfos,
+											&all_part_cols,
 											&partitioned_child_rels);
 
 			/* Close child relations, but keep locks */
@@ -1558,6 +1564,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 		pcinfo->parent_relid = rti;
 		pcinfo->child_rels = partitioned_child_rels;
+		pcinfo->all_part_cols = all_part_cols;
 		root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 	}
 
@@ -1571,6 +1578,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   PlanRowMark *parentrc, PartitionDesc partdesc,
 						   LOCKMODE lockmode,
 						   bool *has_child, List **appinfos,
+						   Bitmapset **all_part_cols,
 						   List **partitioned_child_rels)
 {
 	int			i;
@@ -1595,6 +1603,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		expand_single_inheritance_child(root, parentrte, parentRTindex,
 										parentrel, parentrc, childrel,
 										has_child, appinfos,
+										all_part_cols,
 										partitioned_child_rels);
 
 		/* If this child is itself partitioned, recurse */
@@ -1604,6 +1613,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 										  RelationGetPartitionDesc(childrel),
 										  lockmode,
 										  has_child, appinfos,
+										  all_part_cols,
 										  partitioned_child_rels);
 
 		/* Close child relation, but keep locks */
@@ -1625,6 +1635,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *parentrc, Relation childrel,
 								bool *has_child, List **appinfos,
+								Bitmapset **all_part_cols,
 								List **partitioned_child_rels)
 {
 	Query	   *parse = root->parse;
@@ -1695,8 +1706,11 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		}
 	}
 	else
+	{
 		*partitioned_child_rels = lappend_int(*partitioned_child_rels,
 											  childRTindex);
+		pull_child_partition_columns(all_part_cols, childrel, parentrel);
+	}
 
 	/*
 	 * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 26567cb..326c858 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3162,6 +3162,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3175,6 +3177,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3242,6 +3245,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index 5c17213..58e98c0 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1224,6 +1224,7 @@ typedef struct
 	/* Target type when converting whole-row vars */
 	Oid			to_rowtype;
 	bool	   *found_whole_row;	/* output flag */
+	bool		coerced_var;	/* var is under ConvertRowTypeExpr */
 } map_variable_attnos_context;
 
 static Node *
@@ -1267,22 +1268,29 @@ map_variable_attnos_mutator(Node *node,
 					/* Don't convert unless necessary. */
 					if (context->to_rowtype != var->vartype)
 					{
-						ConvertRowtypeExpr *r;
-
 						/* Var itself is converted to the requested type. */
 						newvar->vartype = context->to_rowtype;
 
 						/*
-						 * And a conversion node on top to convert back to the
-						 * original type.
+						 * If this var is already under a ConvertRowtypeExpr,
+						 * we don't have to add another one.
 						 */
-						r = makeNode(ConvertRowtypeExpr);
-						r->arg = (Expr *) newvar;
-						r->resulttype = var->vartype;
-						r->convertformat = COERCE_IMPLICIT_CAST;
-						r->location = -1;
-
-						return (Node *) r;
+						if (!context->coerced_var)
+						{
+							ConvertRowtypeExpr *r;
+
+							/*
+							 * And a conversion node on top to convert back to
+							 * the original type.
+							 */
+							r = makeNode(ConvertRowtypeExpr);
+							r->arg = (Expr *) newvar;
+							r->resulttype = var->vartype;
+							r->convertformat = COERCE_IMPLICIT_CAST;
+							r->location = -1;
+
+							return (Node *) r;
+						}
 					}
 				}
 			}
@@ -1290,6 +1298,28 @@ map_variable_attnos_mutator(Node *node,
 		}
 		/* otherwise fall through to copy the var normally */
 	}
+	else if (IsA(node, ConvertRowtypeExpr))
+	{
+		ConvertRowtypeExpr *r = (ConvertRowtypeExpr *) node;
+
+		/*
+		 * If this is coercing a var (which is typical), convert only the var,
+		 * as against adding another ConvertRowtypeExpr over it.
+		 */
+		if (IsA(r->arg, Var))
+		{
+			ConvertRowtypeExpr *newnode;
+
+			newnode = (ConvertRowtypeExpr *) palloc(sizeof(ConvertRowtypeExpr));
+			*newnode = *r;
+			context->coerced_var = true;
+			newnode->arg = (Expr *) map_variable_attnos_mutator((Node *) r->arg, context);
+			context->coerced_var = false;
+
+			return (Node *) newnode;
+		}
+		/* Else fall through the expression tree mutator */
+	}
 	else if (IsA(node, Query))
 	{
 		/* Recurse into RTE subquery or not-yet-planned sublink subquery */
@@ -1321,6 +1351,7 @@ map_variable_attnos(Node *node,
 	context.map_length = map_length;
 	context.to_rowtype = to_rowtype;
 	context.found_whole_row = found_whole_row;
+	context.coerced_var = false;
 
 	*found_whole_row = false;
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2283c67..2e29276 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -80,12 +86,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int *num_parted, List **leaf_part_oids);
@@ -99,4 +109,8 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
+
 #endif							/* PARTITION_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7708818..8e2bf5f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,10 +210,12 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60ab..3034b01 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -511,6 +511,11 @@ typedef struct EState
 	struct dsa_area *es_query_dsa;
 } EState;
 
+/* For a given result relation, get its columns being inserted/updated. */
+#define GetInsertedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /*
  * ExecRowMark -
@@ -978,14 +983,31 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..6981f58 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index a39e59d..e3ff127 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2021,6 +2022,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e372f88..b38f2f1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 9366f04..6c0036b 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,25 +198,425 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_c_100_200
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_100_200
+         Filter: (c > '97'::numeric)
+(16 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
 -- cleanup
-drop table range_parted;
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted | a | b  | c  
+--------------+---+----+----
+ (b,15,95)    | b | 15 | 95
+ (b,17,95)    | b | 17 | 95
+(2 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  95
+ part_c_1_100   | b | 17 |  95
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,110), (b,13,98), (b,15,106), (b,17,106)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 110
+ part_c_100_200 | b | 15 | 106
+ part_c_100_200 | b | 17 | 106
+ part_c_1_100   | b | 13 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,146), (b,13,147), (b,15,155), (b,17,155)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 146
+ part_c_100_200 | b | 13 | 147
+ part_c_100_200 | b | 15 | 155
+ part_c_100_200 | b | 17 | 155
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 110
+ part_c_100_200 | b | 17 | 106
+ part_c_100_200 | b | 19 | 106
+ part_c_1_100   | b | 15 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 146
+ part_c_100_200 | b | 16 | 147
+ part_c_100_200 | b | 17 | 155
+ part_c_100_200 | b | 19 | 155
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 150
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  55
+ part_c_1_100   | b | 17 |  55
+(6 rows)
+
+drop table mintab, range_parted CASCADE;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 6637119..da5130d 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,23 +107,253 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
 
 -- cleanup
-drop table range_parted;
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trans_updatetrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop table mintab, range_parted CASCADE;
+
+
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b ( ) ;
+drop table list_parted;

#151

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#149)

Re: UPDATE of partition key

On Thu, Sep 7, 2017 at 6:17 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 3 September 2017 at 17:10, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

After recent commit 30833ba154, now the partitions are expanded in
depth-first order. It didn't seem worthwhile rebasing my partition
walker changes onto the latest code. So in the attached patch, I have
removed all the partition walker changes. But
RelationGetPartitionDispatchInfo() traverses in breadth-first order,
which is different than the update result rels order (because
inheritance expansion order is depth-first). So, in order to make the
tuple-routing-related leaf partitions in the same order as that of the
update result rels, we would have to make changes in
RelationGetPartitionDispatchInfo(), which I am not sure whether it is
going to be done as part of the thread "expanding inheritance in
partition bound order" [1]. For now, in the attached patch, I have
reverted back to the hash table method to find the leaf partitions in
the update result rels.

[1] /messages/by-id/CAJ3gD9eyudCNU6V-veMme+eyzfX_ey+gEzULMzOw26c3f9rzdg@mail.gmail.com

As mentioned by Amit Langote in the above mail thread, he is going to
do changes for making RelationGetPartitionDispatchInfo() return the
leaf partitions in depth-first order. Once that is done, I will then
remove the hash table method for finding leaf partitions in update
result rels, and instead use the earlier efficient method that takes
advantage of the fact that update result rels and leaf partitions are
in the same order.

Has he posted that patch yet? I don't think I saw it, but maybe I
missed something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#152

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Robert Haas (#151)

Re: UPDATE of partition key

On 2017/09/08 18:57, Robert Haas wrote:

As mentioned by Amit Langote in the above mail thread, he is going to
do changes for making RelationGetPartitionDispatchInfo() return the
leaf partitions in depth-first order. Once that is done, I will then
remove the hash table method for finding leaf partitions in update
result rels, and instead use the earlier efficient method that takes
advantage of the fact that update result rels and leaf partitions are
in the same order.

Has he posted that patch yet? I don't think I saw it, but maybe I
missed something.

I will post on that thread in a moment.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#153

amul sul

sulamul@gmail.com

over 8 years ago

In reply to: Amit Langote (#152)

1 attachment(s)

Re: UPDATE of partition key

On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think we can do this even without using an additional infomask bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

Hmm. How would that work?

We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated). Can you explain what difficulty are you
envisioning?

Attaching WIP patch incorporates the above logic, although I am yet to check
all the code for places which might be using ip_blkid. I have got a small
query here,
do we need an error on HeapTupleSelfUpdated case as well?

Note that patch should be applied to the top of Amit Khandekar's latest
patch(v17_rebased).

Regards,
Amul

Attachments:

0002-invalidate-ctid.ip_blkid-WIP.patchapplication/octet-stream; name=0002-invalidate-ctid.ip_blkid-WIP.patchDownload

From 2c268beb0f51de3e84ddf5d6954ad1a1a4f0213f Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Fri, 8 Sep 2017 16:04:12 +0530
Subject: [PATCH 2/2] invalidate ctid.ip_blkid WIP

Set ctid.ip_blkid to InvalidBlockNumber while moving tuple to the
another partition.

Note: Apply this patch to the top of the
update-partition-key_v17_rebased.patch
---
 src/backend/access/heap/heapam.c       | 11 +++++++++--
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 21 +++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 6 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d20f038..66cb22b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3027,7 +3027,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3295,6 +3295,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to an another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3420,7 +3427,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 6904c4e..862206e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3111,6 +3111,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1508f72..35d172e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2704,6 +2704,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 9389560..1b388e6 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to an another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index ecf51db..4d01324 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -740,7 +740,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *delete_skipped,
 		   bool process_returning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -830,7 +831,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -876,6 +878,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1151,7 +1158,7 @@ lreplace:;
 			 * from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &delete_skipped, false, false);
+					   &delete_skipped, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (for e.g. trigger
@@ -1262,6 +1269,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1281,6 +1293,7 @@ lreplace:;
 						goto lreplace;
 					}
 				}
+
 				/* tuple already deleted; nothing to do */
 				return NULL;
 
@@ -2000,7 +2013,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024..76f56cf 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
-- 
2.6.2

#154

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: amul sul (#153)

Re: UPDATE of partition key

On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote:

On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I think we can do this even without using an additional infomask bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

Hmm. How would that work?

We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated). Can you explain what difficulty are you
envisioning?

Attaching WIP patch incorporates the above logic, although I am yet to check
all the code for places which might be using ip_blkid. I have got a small
query here,
do we need an error on HeapTupleSelfUpdated case as well?

No, because that case is anyway a no-op (or error depending on whether
is updated/deleted by same command or later command). Basically, even
if the row wouldn't have been moved to another partition, we would not
have allowed the command to proceed with the update. This handling is
to make commands fail rather than a no-op where otherwise (when the
tuple is not moved to another partition) the command would have
succeeded.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#155

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#148)

Re: UPDATE of partition key

On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Actually, since transition tables came in, the functions like
ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
purpose of capturing transition table rows, so that the images of the
tables are visible when statement triggers are fired that refer to
these transition tables. So in the above code, these functions only
capture rows, they do not add any event for firing any ROW triggers.
AfterTriggerSaveEvent() returns without adding any event if it's
called only for transition capture. So even if UPDATE row triggers are
defined, they won't get fired in case of row movement, although the
updated rows would be captured if transition tables are referenced in
these triggers or in the statement triggers.

Ok then I have one more question,

With transition table, we can only support statement level trigger and
for update
statement, we are only going to execute UPDATE statement level
trigger? so is there
any point of making transition table entry for DELETE/INSERT trigger
as those transition
table will never be used. Or I am missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#156

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#155)

Re: UPDATE of partition key

On 11 September 2017 at 21:12, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 6 September 2017 at 21:47, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Actually, since transition tables came in, the functions like
ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
purpose of capturing transition table rows, so that the images of the
tables are visible when statement triggers are fired that refer to
these transition tables. So in the above code, these functions only
capture rows, they do not add any event for firing any ROW triggers.
AfterTriggerSaveEvent() returns without adding any event if it's
called only for transition capture. So even if UPDATE row triggers are
defined, they won't get fired in case of row movement, although the
updated rows would be captured if transition tables are referenced in
these triggers or in the statement triggers.

Ok then I have one more question,

With transition table, we can only support statement level trigger

Yes, we don't support row triggers with transition tables if the table
is a partition.

and for update
statement, we are only going to execute UPDATE statement level
trigger? so is there
any point of making transition table entry for DELETE/INSERT trigger
as those transition
table will never be used.

But the statement level trigger function can refer to OLD TABLE and
NEW TABLE, which will contain all the OLD rows and NEW rows
respectively. So the updated rows of the partitions (including the
moved ones) need to be captured. So for OLD TABLE, we need to capture
the deleted row, and for NEW TABLE, we need to capture the inserted
row.

In the regression test update.sql, check how the statement trigger
trans_updatetrig prints all the updated rows, including the moved
ones.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#157

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#156)

Re: UPDATE of partition key

On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

But the statement level trigger function can refer to OLD TABLE and
NEW TABLE, which will contain all the OLD rows and NEW rows
respectively. So the updated rows of the partitions (including the
moved ones) need to be captured. So for OLD TABLE, we need to capture
the deleted row, and for NEW TABLE, we need to capture the inserted
row.

Yes, I agree. So in ExecDelete for OLD TABLE we only need to call
ExecARUpdateTriggers which will make the entry in OLD TABLE only if
transition table is there otherwise nothing and I guess this part
already exists in your patch. And, we are also calling
ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
trigger and that is also fine. What I don't understand is that if
there is no "ROW- LEVEL delete trigger" and there is only a "statement
level delete trigger" with transition table still we are making the
entry in transition table of the delete trigger and that will never be
used.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#158

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#157)

Re: UPDATE of partition key

On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

But the statement level trigger function can refer to OLD TABLE and
NEW TABLE, which will contain all the OLD rows and NEW rows
respectively. So the updated rows of the partitions (including the
moved ones) need to be captured. So for OLD TABLE, we need to capture
the deleted row, and for NEW TABLE, we need to capture the inserted
row.

Yes, I agree. So in ExecDelete for OLD TABLE we only need to call
ExecARUpdateTriggers which will make the entry in OLD TABLE only if
transition table is there otherwise nothing and I guess this part
already exists in your patch. And, we are also calling
ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
trigger and that is also fine. What I don't understand is that if
there is no "ROW- LEVEL delete trigger" and there is only a "statement
level delete trigger" with transition table still we are making the
entry in transition table of the delete trigger and that will never be
used.

Hmm, ok, that might be happening, since we are calling
ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL,
and so the deleted tuple gets captured even when there is no UPDATE
statement trigger defined, which looks redundant. Will check this.
Thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#159

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#150)

1 attachment(s)

Re: UPDATE of partition key

On 8 September 2017 at 15:21, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is the patch rebased on latest HEAD.

The patch got bit rotten again. Rebased version v17_rebased_2.patch
has also some scenarios added in update.sql , that cover UPDATE row
movement from non-default to default partition and vice versa.

Thanks
-Amit Khandekar

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v17_rebased_2.patchapplication/octet-stream; name=update-partition-key_v17_rebased_2.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 73eff17..a0d3583 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1103,7 +1103,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1116,8 +1117,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1126,14 +1127,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2473,6 +2474,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..4ac5bd6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -2446,13 +2446,15 @@ CopyFrom(CopyState cstate)
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2482,7 +2484,7 @@ CopyFrom(CopyState cstate)
 			for (i = 0; i < cstate->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2616,7 +2618,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2726,7 +2728,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2846,7 +2848,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 269c9e1..f9ea29f 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -67,15 +67,6 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 /* How many levels deep into trigger execution are we? */
 static int	MyTriggerDepth = 0;
 
-/*
- * Note that similar macros also exist in executor/execMain.c.  There does not
- * appear to be any good header to put them into, given the structures that
- * they use, so we let them be duplicated.  Be sure to update all if one needs
- * to be changed, however.
- */
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
-
 /* Local function prototypes */
 static void ConvertTriggerToFK(CreateTrigStmt *stmt, Oid funcoid);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
@@ -2903,8 +2894,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5211,7 +5207,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5260,12 +5261,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_old_tuplestore;
 
 			if (map != NULL)
@@ -5278,12 +5294,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			if (event == TRIGGER_EVENT_INSERT)
 				new_tuplestore = transition_capture->tcs_insert_tuplestore;
 			else
@@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b594d4..1508f72 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -65,6 +65,18 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
+/*
+ * Entry of a temporary hash table. During UPDATE tuple routing, we want to
+ * know which of the leaf partitions are present in the UPDATE per-subplan
+ * resultRelInfo array (ModifyTableState->resultRelInfo[]). This hash table
+ * is searchable by the oids of the subplan result rels.
+ */
+typedef struct ResultRelOidsEntry
+{
+	Oid			rel_oid;
+	ResultRelInfo *resultRelInfo;
+} ResultRelOidsEntry;
+
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
@@ -104,19 +116,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
-/*
- * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
- * not appear to be any good header to put it into, given the structures that
- * it uses, so we let them be duplicated.  Be sure to update both if one needs
- * to be changed, however.
- */
-#define GetInsertedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /* end of local decls */
 
@@ -1850,15 +1849,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1880,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1947,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2064,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3242,34 +3252,40 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *		with on entry for every leaf partition (required to convert input tuple
+ *		based on the root table's rowtype to a leaf partition's rowtype after
+ *		tuple routing is done)
  * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
  *		to manipulate any given leaf partition's rowtype after that partition
  *		is chosen by tuple-routing.
  * 'num_parted' receives the number of partitioned tables in the partition
  *		tree (= the number of entries in the 'pd' output array)
  * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *		tree (= the number of entries in the 'partitions' and
+ *		'perleaf_parentchild_maps' output arrays
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
-							   TupleConversionMap ***tup_conv_maps,
+							   ResultRelInfo ***partitions,
+							   TupleConversionMap ***perleaf_parentchild_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3277,7 +3293,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	HTAB	   *result_rel_oids = NULL;
+	HASHCTL		ctl;
+	ResultRelOidsEntry *hash_entry;
+	ResultRelInfo *leaf_part_arr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3286,10 +3305,50 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+											sizeof(ResultRelInfo *));
+	*perleaf_parentchild_maps = (TupleConversionMap **) palloc0(*num_partitions *
+																sizeof(TupleConversionMap *));
+
+	/*
+	 * For Updates, if the leaf partition is already present in the per-subplan
+	 * result rels, we re-use that rather than initialize a new result rel. So
+	 * to find whether a given leaf partition already has a resultRel, we build
+	 * the hash table for searching each of the leaf partitions by oid.
+	 */
+	if (num_update_rri != 0)
+	{
+		ResultRelInfo	   *resultRelInfo;
+
+		memset(&ctl, 0, sizeof(ctl));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(ResultRelOidsEntry);
+		ctl.hcxt = CurrentMemoryContext;
+		result_rel_oids = hash_create("result_rel_oids temporary hash",
+								32, /* start small and extend */
+								&ctl,
+								HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+		resultRelInfo = update_rri;
+		for (i = 0; i < num_update_rri; i++, resultRelInfo++)
+		{
+			Oid reloid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			hash_entry = hash_search(result_rel_oids, &reloid,
+									 HASH_ENTER, NULL);
+			hash_entry->resultRelInfo = resultRelInfo;
+		}
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -3299,36 +3358,76 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/*
+			 * If this leaf partition is already present in the per-subplan
+			 * resultRelInfos, re-use that resultRelInfo along with its
+			 * already-opened relation; otherwise create a new result rel.
+			 */
+			hash_entry = hash_search(result_rel_oids, &leaf_oid,
+									 HASH_FIND, NULL);
+			if (hash_entry != NULL)
+			{
+				leaf_part_rri = hash_entry->resultRelInfo;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
-
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
+		(*perleaf_parentchild_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+																gettext_noop("could not convert row type"));
 
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3344,9 +3443,12 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	if (result_rel_oids != NULL)
+		hash_destroy(result_rel_oids);
 }
 
 /*
@@ -3372,8 +3474,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 5a75e02..6b8af46 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 49586a3..400612b 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,7 +54,6 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
-
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
 					 ItemPointer conflictTid,
@@ -240,6 +240,36 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate, TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot, TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -281,17 +311,50 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -303,7 +366,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -331,7 +394,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -348,23 +411,11 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -482,7 +533,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -622,6 +673,19 @@ ExecInsert(ModifyTableState *mtstate,
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
 						 mtstate->mt_transition_capture);
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
 	list_free(recheckIndexes);
 
 	/*
@@ -674,6 +738,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -682,6 +748,9 @@ ExecDelete(ModifyTableState *mtstate,
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
 
+	if (delete_skipped)
+		*delete_skipped = true;
+
 	/*
 	 * get information on the (current) result relation
 	 */
@@ -845,12 +914,29 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
 						 mtstate->mt_transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+		ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -943,6 +1029,8 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
+
 
 	/*
 	 * abort the operation if not running transactions
@@ -1039,12 +1127,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1463,6 +1621,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1470,63 +1667,115 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc);
 
+	if (mtstate->mt_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.
 	 */
-	if (mtstate->mt_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next plan.
+	 * (INSERT operations set it every time.)
+	 */
+	if (mtstate->mt_persubplan_childparent_maps)
+	{
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
+
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
+
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
 	{
-		ResultRelInfo *resultRelInfos;
-		int			numResultRelInfos;
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
 
-		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
 		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For INSERT via partitioned table, so we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
-			numResultRelInfos = mtstate->mt_num_partitions;
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
-			numResultRelInfos = mtstate->mt_nplans;
-		}
-
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
-		for (i = 0; i < numResultRelInfos; ++i)
-		{
-			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time.)
-		 */
-		mtstate->mt_transition_capture->tcs_map =
-			mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1632,9 +1881,9 @@ ExecModifyTable(PlanState *pstate)
 				if (node->mt_transition_capture != NULL)
 				{
 					/* Prepare to convert transition tuples from this child. */
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1750,7 +1999,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1795,9 +2045,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1870,6 +2123,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1907,33 +2169,63 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
-		TupleConversionMap **partition_tupconv_maps;
+		ResultRelInfo **partitions;
+		TupleConversionMap **perleaf_parentchild_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
-									   &partition_tupconv_maps,
+									   &perleaf_parentchild_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = perleaf_parentchild_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
 	/* Build state for collecting transition tuples */
 	ExecSetupTransitionCaptureState(mtstate, estate);
 
@@ -1967,50 +2259,54 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2021,7 +2317,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2058,20 +2354,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2317,6 +2619,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Free transition tables, unless this query is being run in
@@ -2359,13 +2662,25 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f1bed14..2d86593 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2260,6 +2261,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8b56b91..9428c2c 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b83d919..2492cb8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2096,6 +2097,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2518,6 +2520,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index fbf8330..0b1c70e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 2d7e1d8..8c08d50 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1291,7 +1291,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	rte = planner_rt_fetch(rel->relid, root);
 	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, rel->relid);
+		partitioned_rels = get_partitioned_child_rels(root, rel->relid, NULL);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2821662..85e3126 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2361,6 +2362,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6405,6 +6407,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6431,6 +6434,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6b79b3a..68e0302 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1056,6 +1056,7 @@ inheritance_planner(PlannerInfo *root)
 	Index		rti;
 	RangeTblEntry *parent_rte;
 	List	   *partitioned_rels = NIL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1370,9 +1371,15 @@ inheritance_planner(PlannerInfo *root)
 
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
+		Bitmapset  *all_part_cols = NULL;
+
+		partitioned_rels = get_partitioned_child_rels(root, parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/* Result path must go into outer query's FINAL upperrel */
@@ -1429,6 +1436,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2046,6 +2054,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6076,10 +6085,15 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: Only call this function on RTEs known to be partitioned tables.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6091,6 +6105,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index ccf2145..fc7c597 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -107,12 +107,14 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   PlanRowMark *parentrc, PartitionDesc partdesc,
 						   LOCKMODE lockmode,
 						   bool *has_child, List **appinfos,
+						   Bitmapset **all_part_cols,
 						   List **partitioned_child_rels);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *parentrc, Relation childrel,
 								bool *has_child, List **appinfos,
+								Bitmapset **all_part_cols,
 								List **partitioned_child_rels);
 static void make_inh_translation_list(Relation oldrelation,
 						  Relation newrelation,
@@ -1397,6 +1399,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	bool		has_child;
 	PartitionedChildRelInfo *pcinfo;
 	List	   *partitioned_child_rels = NIL;
+	Bitmapset  *all_part_cols = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte->inh)
@@ -1479,11 +1482,13 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
 										oldrelation,
 										&has_child, &appinfos,
+										&all_part_cols,
 										&partitioned_child_rels);
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 									  RelationGetPartitionDesc(oldrelation),
 									  lockmode,
 									  &has_child, &appinfos,
+									  &all_part_cols,
 									  &partitioned_child_rels);
 	}
 	else
@@ -1519,6 +1524,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc,
 											newrelation,
 											&has_child, &appinfos,
+											&all_part_cols,
 											&partitioned_child_rels);
 
 			/* Close child relations, but keep locks */
@@ -1558,6 +1564,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 		pcinfo->parent_relid = rti;
 		pcinfo->child_rels = partitioned_child_rels;
+		pcinfo->all_part_cols = all_part_cols;
 		root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 	}
 
@@ -1571,6 +1578,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   PlanRowMark *parentrc, PartitionDesc partdesc,
 						   LOCKMODE lockmode,
 						   bool *has_child, List **appinfos,
+						   Bitmapset **all_part_cols,
 						   List **partitioned_child_rels)
 {
 	int			i;
@@ -1595,6 +1603,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		expand_single_inheritance_child(root, parentrte, parentRTindex,
 										parentrel, parentrc, childrel,
 										has_child, appinfos,
+										all_part_cols,
 										partitioned_child_rels);
 
 		/* If this child is itself partitioned, recurse */
@@ -1604,6 +1613,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 										  RelationGetPartitionDesc(childrel),
 										  lockmode,
 										  has_child, appinfos,
+										  all_part_cols,
 										  partitioned_child_rels);
 
 		/* Close child relation, but keep locks */
@@ -1625,6 +1635,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
 								PlanRowMark *parentrc, Relation childrel,
 								bool *has_child, List **appinfos,
+								Bitmapset **all_part_cols,
 								List **partitioned_child_rels)
 {
 	Query	   *parse = root->parse;
@@ -1695,8 +1706,11 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		}
 	}
 	else
+	{
 		*partitioned_child_rels = lappend_int(*partitioned_child_rels,
 											  childRTindex);
+		pull_child_partition_columns(all_part_cols, childrel, parentrel);
+	}
 
 	/*
 	 * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE.
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 26567cb..326c858 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3162,6 +3162,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3175,6 +3177,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3242,6 +3245,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index 5c17213..58e98c0 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1224,6 +1224,7 @@ typedef struct
 	/* Target type when converting whole-row vars */
 	Oid			to_rowtype;
 	bool	   *found_whole_row;	/* output flag */
+	bool		coerced_var;	/* var is under ConvertRowTypeExpr */
 } map_variable_attnos_context;
 
 static Node *
@@ -1267,22 +1268,29 @@ map_variable_attnos_mutator(Node *node,
 					/* Don't convert unless necessary. */
 					if (context->to_rowtype != var->vartype)
 					{
-						ConvertRowtypeExpr *r;
-
 						/* Var itself is converted to the requested type. */
 						newvar->vartype = context->to_rowtype;
 
 						/*
-						 * And a conversion node on top to convert back to the
-						 * original type.
+						 * If this var is already under a ConvertRowtypeExpr,
+						 * we don't have to add another one.
 						 */
-						r = makeNode(ConvertRowtypeExpr);
-						r->arg = (Expr *) newvar;
-						r->resulttype = var->vartype;
-						r->convertformat = COERCE_IMPLICIT_CAST;
-						r->location = -1;
-
-						return (Node *) r;
+						if (!context->coerced_var)
+						{
+							ConvertRowtypeExpr *r;
+
+							/*
+							 * And a conversion node on top to convert back to
+							 * the original type.
+							 */
+							r = makeNode(ConvertRowtypeExpr);
+							r->arg = (Expr *) newvar;
+							r->resulttype = var->vartype;
+							r->convertformat = COERCE_IMPLICIT_CAST;
+							r->location = -1;
+
+							return (Node *) r;
+						}
 					}
 				}
 			}
@@ -1290,6 +1298,28 @@ map_variable_attnos_mutator(Node *node,
 		}
 		/* otherwise fall through to copy the var normally */
 	}
+	else if (IsA(node, ConvertRowtypeExpr))
+	{
+		ConvertRowtypeExpr *r = (ConvertRowtypeExpr *) node;
+
+		/*
+		 * If this is coercing a var (which is typical), convert only the var,
+		 * as against adding another ConvertRowtypeExpr over it.
+		 */
+		if (IsA(r->arg, Var))
+		{
+			ConvertRowtypeExpr *newnode;
+
+			newnode = (ConvertRowtypeExpr *) palloc(sizeof(ConvertRowtypeExpr));
+			*newnode = *r;
+			context->coerced_var = true;
+			newnode->arg = (Expr *) map_variable_attnos_mutator((Node *) r->arg, context);
+			context->coerced_var = false;
+
+			return (Node *) newnode;
+		}
+		/* Else fall through the expression tree mutator */
+	}
 	else if (IsA(node, Query))
 	{
 		/* Recurse into RTE subquery or not-yet-planned sublink subquery */
@@ -1321,6 +1351,7 @@ map_variable_attnos(Node *node,
 	context.map_length = map_length;
 	context.to_rowtype = to_rowtype;
 	context.found_whole_row = found_whole_row;
+	context.coerced_var = false;
 
 	*found_whole_row = false;
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 454a940..9b222b6 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -80,12 +86,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int *num_parted, List **leaf_part_oids);
@@ -99,6 +109,9 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7708818..8e2bf5f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,10 +210,12 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60ab..3034b01 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -511,6 +511,11 @@ typedef struct EState
 	struct dsa_area *es_query_dsa;
 } EState;
 
+/* For a given result relation, get its columns being inserted/updated. */
+#define GetInsertedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /*
  * ExecRowMark -
@@ -978,14 +983,31 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..6981f58 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index a39e59d..e3ff127 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2021,6 +2022,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e372f88..b38f2f1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index cef70b1..4bc21f8 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,27 +198,334 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_c_100_200
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_100_200
+         Filter: (c > '97'::numeric)
+(16 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted | a | b  | c  
+--------------+---+----+----
+ (b,15,95)    | b | 15 | 95
+ (b,17,95)    | b | 17 | 95
+(2 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  95
+ part_c_1_100   | b | 17 |  95
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,110), (b,13,98), (b,15,106), (b,17,106)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 110
+ part_c_100_200 | b | 15 | 106
+ part_c_100_200 | b | 17 | 106
+ part_c_1_100   | b | 13 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,146), (b,13,147), (b,15,155), (b,17,155)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 146
+ part_c_100_200 | b | 13 | 147
+ part_c_100_200 | b | 15 | 155
+ part_c_100_200 | b | 17 | 155
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 110
+ part_c_100_200 | b | 17 | 106
+ part_c_100_200 | b | 19 | 106
+ part_c_1_100   | b | 15 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 146
+ part_c_100_200 | b | 16 | 147
+ part_c_100_200 | b | 17 | 155
+ part_c_100_200 | b | 19 | 155
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 150
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  55
+ part_c_1_100   | b | 17 |  55
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger c100_delete_trig ON part_c_100_200;
+drop trigger c100_update_trig ON part_c_100_200;
+drop trigger c100_insert_trig ON part_c_100_200;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
                                   Table "public.part_def"
@@ -226,6 +533,7 @@ create table part_def partition of range_parted default;
 --------+---------+-----------+----------+---------+----------+--------------+-------------
  a      | text    |           |          |         | extended |              | 
  b      | integer |           |          |         | plain    |              | 
+ c      | numeric |           |          |         | main     |              | 
 Partition of: range_parted DEFAULT
 Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
 
@@ -235,7 +543,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null).
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_def       | d |  9 |    
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+ partname | a  | b  |  c  
+----------+----+----+-----
+ part_def | ad |  1 |    
+ part_def | ad | 10 | 200
+ part_def | bd | 12 |  96
+ part_def | bd | 13 |  97
+ part_def | bd | 15 | 105
+ part_def | bd | 17 | 105
+ part_def | d  |  9 |    
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_def       | d |  9 |    
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +606,110 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
--- cleanup
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b ( ) ;
 drop table range_parted;
 drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 66d1fec..21f903e 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,191 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trans_updatetrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger c100_delete_trig ON part_c_100_200;
+drop trigger c100_update_trig ON part_c_100_200;
+drop trigger c100_insert_trig ON part_c_100_200;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +300,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +329,82 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
--- cleanup
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b ( ) ;
 drop table range_parted;
 drop table list_parted;

#160

amul sul

sulamul@gmail.com

over 8 years ago

In reply to: Amit Kapila (#154)

1 attachment(s)

Re: UPDATE of partition key

On Sun, Sep 10, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote:

On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapila16@gmail.com

wrote:

I think we can do this even without using an additional infomask bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

Hmm. How would that work?

We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated). Can you explain what difficulty are you
envisioning?

Attaching WIP patch incorporates the above logic, although I am yet to

check

all the code for places which might be using ip_blkid. I have got a

small

query here,
do we need an error on HeapTupleSelfUpdated case as well?

No, because that case is anyway a no-op (or error depending on whether
is updated/deleted by same command or later command). Basically, even
if the row wouldn't have been moved to another partition, we would not
have allowed the command to proceed with the update. This handling is
to make commands fail rather than a no-op where otherwise (when the
tuple is not moved to another partition) the command would have
succeeded.

Thank you.

I've rebased patch against Amit Khandekar's latest

patch

(v17_rebased_2)
.
Also
added ip_blkid validation

check in heap_get_latest_tid(), rewrite_heap_tuple()
& rewrite_heap_tuple() function, because only

ItemPointerEquals() check is no
longer sufficient
after
this patch.

Regards,
Amul

Attachments:

0002-invalidate_ctid-ip_blkid-WIP_2.patchapplication/octet-stream; name=0002-invalidate_ctid-ip_blkid-WIP_2.patchDownload

From 8d3aa99269334bff8b216086c09879e09232b40b Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Wed, 13 Sep 2017 15:30:09 +0530
Subject: [PATCH 2/2] invalidate_ctid-ip_blkid WIP_2

Set ctid.ip_blkid to InvalidBlockNumber while moving tuple to the
another partition.
---
 src/backend/access/heap/heapam.c       | 15 ++++++++++++---
 src/backend/access/heap/rewriteheap.c  |  3 ++-
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 21 +++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 7 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d20f038..5764980 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2281,7 +2281,8 @@ heap_get_latest_tid(Relation relation,
 		 */
 		if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
 			HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
-			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid) ||
+			!BlockNumberIsValid(BlockIdGetBlockNumber(&(tp.t_data->t_ctid).ip_blkid)))
 		{
 			UnlockReleaseBuffer(buffer);
 			break;
@@ -3027,7 +3028,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3295,6 +3296,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to an another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3420,7 +3428,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -5922,6 +5930,7 @@ next:
 		/* if we find the end of update chain, we're done. */
 		if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
 			ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+			!BlockNumberIsValid(BlockIdGetBlockNumber(&(mytup.t_data->t_ctid).ip_blkid)) ||
 			HeapTupleHeaderIsOnlyLocked(mytup.t_data))
 		{
 			result = HeapTupleMayBeUpdated;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bd560e4..ebdc081 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -422,7 +422,8 @@ rewrite_heap_tuple(RewriteState state,
 	if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
 		  HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
 		!(ItemPointerEquals(&(old_tuple->t_self),
-							&(old_tuple->t_data->t_ctid))))
+							&(old_tuple->t_data->t_ctid))) &&
+		BlockNumberIsValid(BlockIdGetBlockNumber(&(old_tuple->t_data->t_ctid).ip_blkid)))
 	{
 		OldToNewMapping mapping;
 
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index f9ea29f..b01a961 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3111,6 +3111,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1508f72..35d172e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2704,6 +2704,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 9389560..1b388e6 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to an another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 400612b..7114734 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -740,7 +740,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *delete_skipped,
 		   bool process_returning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -830,7 +831,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -876,6 +878,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1151,7 +1158,7 @@ lreplace:;
 			 * from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &delete_skipped, false, false);
+					   &delete_skipped, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (for e.g. trigger
@@ -1262,6 +1269,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1281,6 +1293,7 @@ lreplace:;
 						goto lreplace;
 					}
 				}
+
 				/* tuple already deleted; nothing to do */
 				return NULL;
 
@@ -2000,7 +2013,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024..76f56cf 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
-- 
2.6.2

#161

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#158)

1 attachment(s)

Re: UPDATE of partition key

On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

But the statement level trigger function can refer to OLD TABLE and
NEW TABLE, which will contain all the OLD rows and NEW rows
respectively. So the updated rows of the partitions (including the
moved ones) need to be captured. So for OLD TABLE, we need to capture
the deleted row, and for NEW TABLE, we need to capture the inserted
row.

Yes, I agree. So in ExecDelete for OLD TABLE we only need to call
ExecARUpdateTriggers which will make the entry in OLD TABLE only if
transition table is there otherwise nothing and I guess this part
already exists in your patch. And, we are also calling
ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
trigger and that is also fine. What I don't understand is that if
there is no "ROW- LEVEL delete trigger" and there is only a "statement
level delete trigger" with transition table still we are making the
entry in transition table of the delete trigger and that will never be
used.

Hmm, ok, that might be happening, since we are calling
ExecARDeleteTriggers() with mtstate->mt_transition_capture non-NULL,
and so the deleted tuple gets captured even when there is no UPDATE
statement trigger defined, which looks redundant. Will check this.
Thanks.

I found out that, in case when there is a DELETE statement trigger
using transition tables, it's not only an issue of redundancy; it's a
correctness issue. Since for transition tables both DELETE and UPDATE
use the same old row tuplestore for capturing OLD table, that table
gets duplicate rows: one from ExecARDeleteTriggers() and another from
ExecARUpdateTriggers(). In presence of INSERT statement trigger using
transition tables, both INSERT and UPDATE events have separate
tuplestore, so duplicate rows don't show up in the UPDATE NEW table.
But, nevertheless, we need to prevent NEW rows to be collected in the
INSERT event tuplestore, and capture the NEW rows only in the UPDATE
event tuplestore.

In the attached patch, we first call ExecARUpdateTriggers(), and while
doing that, we first save the info that a NEW row is already captured
(mtstate->mt_transition_capture->tcs_update_old_table == true). If it
captured, we pass NULL transition_capture pointer to
ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
again capture an extra row.

Modified a testcase in update.sql by including DELETE statement
trigger that uses transition tables.

-------

After commit 77b6b5e9c, the order of leaf partitions returned by
RelationGetPartitionDispatchInfo() and the order of the UPDATE result
rels are in the same order. Earlier, because of different orders, I
had to use a hash table to search for the leaf partitions in the
update result rels, so that we could re-use the per-subplan UPDATE
ResultRelInfo's. Now since the order is same, in the attached patch, I
have removed the hash table method, and instead, iterate over the leaf
partition oids and at the same time keep shifting a position over the
per-subplan resultrels whenever the resultrel at the position is found
to be present in the leaf partitions list.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v18.patchapplication/octet-stream; name=update-partition-key_v18.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index 950245d..72300a0 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -160,6 +160,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 1ab6dba..737c9e30 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1105,7 +1105,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1118,8 +1119,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1128,14 +1129,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2439,6 +2440,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..4ac5bd6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -2446,13 +2446,15 @@ CopyFrom(CopyState cstate)
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2482,7 +2484,7 @@ CopyFrom(CopyState cstate)
 			for (i = 0; i < cstate->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2616,7 +2618,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2726,7 +2728,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2846,7 +2848,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 269c9e1..f9ea29f 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -67,15 +67,6 @@ int			SessionReplicationRole = SESSION_REPLICATION_ROLE_ORIGIN;
 /* How many levels deep into trigger execution are we? */
 static int	MyTriggerDepth = 0;
 
-/*
- * Note that similar macros also exist in executor/execMain.c.  There does not
- * appear to be any good header to put them into, given the structures that
- * they use, so we let them be duplicated.  Be sure to update all if one needs
- * to be changed, however.
- */
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
-
 /* Local function prototypes */
 static void ConvertTriggerToFK(CreateTrigStmt *stmt, Oid funcoid);
 static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
@@ -2903,8 +2894,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5211,7 +5207,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5260,12 +5261,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_old_tuplestore;
 
 			if (map != NULL)
@@ -5278,12 +5294,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			if (event == TRIGGER_EVENT_INSERT)
 				new_tuplestore = transition_capture->tcs_insert_tuplestore;
 			else
@@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b594d4..67bfd2c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -65,7 +65,6 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
 
-
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
 ExecutorStart_hook_type ExecutorStart_hook = NULL;
 ExecutorRun_hook_type ExecutorRun_hook = NULL;
@@ -104,19 +103,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
-/*
- * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
- * not appear to be any good header to put it into, given the structures that
- * it uses, so we let them be duplicated.  Be sure to update both if one needs
- * to be changed, however.
- */
-#define GetInsertedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
-#define GetUpdatedColumns(relinfo, estate) \
-	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /* end of local decls */
 
@@ -1850,15 +1836,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1867,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1934,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2051,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3242,34 +3239,40 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *		with on entry for every leaf partition (required to convert input tuple
+ *		based on the root table's rowtype to a leaf partition's rowtype after
+ *		tuple routing is done)
  * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
  *		to manipulate any given leaf partition's rowtype after that partition
  *		is chosen by tuple-routing.
  * 'num_parted' receives the number of partitioned tables in the partition
  *		tree (= the number of entries in the 'pd' output array)
  * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *		tree (= the number of entries in the 'partitions' and
+ *		'perleaf_parentchild_maps' output arrays
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
-							   TupleConversionMap ***tup_conv_maps,
+							   ResultRelInfo ***partitions,
+							   TupleConversionMap ***perleaf_parentchild_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3277,7 +3280,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3286,10 +3291,37 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+											sizeof(ResultRelInfo *));
+	*perleaf_parentchild_maps = (TupleConversionMap **) palloc0(*num_partitions *
+																sizeof(TupleConversionMap *));
+
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a
+		 * new result rel. The per-subplan resultrels and the resultrels of
+		 * the leaf partitions are both in the same canonical order. So while
+		 * going through the leaf partition oids, we need to keep track of the
+		 * next per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by
+		 * one while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -3299,36 +3331,83 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
-
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
+		(*perleaf_parentchild_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+																gettext_noop("could not convert row type"));
 
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3344,9 +3423,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to
+	 * the last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
@@ -3372,8 +3460,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 5a75e02..6b8af46 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 49586a3..9d2aed4 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -53,7 +54,6 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
-
 static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 ResultRelInfo *resultRelInfo,
 					 ItemPointer conflictTid,
@@ -63,6 +63,11 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
 
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -240,6 +245,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -265,6 +302,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -281,17 +319,50 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+		 * does not belong to subplans, then it already matches the root tuple
+		 * descriptor; although there is no such known scenario where this
+		 * could happen.
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL &&
+			resultRelInfo >= mtstate->resultRelInfo &&
+			resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -303,7 +374,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -331,7 +402,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -348,23 +419,11 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -482,7 +541,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -618,9 +677,31 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -674,6 +755,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -681,6 +764,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
+
+	if (delete_skipped)
+		*delete_skipped = true;
 
 	/*
 	 * get information on the (current) result relation
@@ -845,12 +932,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -943,6 +1057,8 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
+
 
 	/*
 	 * abort the operation if not running transactions
@@ -1039,12 +1155,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1463,6 +1649,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1470,63 +1695,115 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(targetRelInfo->ri_TrigDesc);
 
+	if (mtstate->mt_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.
 	 */
-	if (mtstate->mt_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next plan.
+	 * (INSERT operations set it every time.)
+	 */
+	if (mtstate->mt_persubplan_childparent_maps)
+	{
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
+
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
+
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
+	{
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
+
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
 	{
-		ResultRelInfo *resultRelInfos;
-		int			numResultRelInfos;
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
 
-		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
 		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For INSERT via partitioned table, so we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			resultRelInfos = mtstate->mt_partitions;
-			numResultRelInfos = mtstate->mt_num_partitions;
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
-			numResultRelInfos = mtstate->mt_nplans;
-		}
-
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
-		for (i = 0; i < numResultRelInfos; ++i)
-		{
-			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
 									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
 									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time.)
-		 */
-		mtstate->mt_transition_capture->tcs_map =
-			mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1632,9 +1909,9 @@ ExecModifyTable(PlanState *pstate)
 				if (node->mt_transition_capture != NULL)
 				{
 					/* Prepare to convert transition tuples from this child. */
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1750,7 +2027,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1795,9 +2073,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1870,6 +2151,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1907,33 +2197,63 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
-		TupleConversionMap **partition_tupconv_maps;
+		ResultRelInfo **partitions;
+		TupleConversionMap **perleaf_parentchild_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
-									   &partition_tupconv_maps,
+									   &perleaf_parentchild_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = perleaf_parentchild_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
+	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
 	/* Build state for collecting transition tuples */
 	ExecSetupTransitionCaptureState(mtstate, estate);
 
@@ -1967,50 +2287,54 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-			List	   *mapped_wcoList;
+			Relation	partrel;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			resultRelInfo = mtstate->mt_partitions[i];
+
+			partrel = resultRelInfo->ri_RelationDesc;
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2021,7 +2345,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2058,20 +2382,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2317,6 +2647,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Free transition tables, unless this query is being run in
@@ -2359,13 +2690,25 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f1bed14..2d86593 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2260,6 +2261,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8b56b91..9428c2c 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b83d919..2492cb8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2096,6 +2097,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2518,6 +2520,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index fbf8330..0b1c70e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5b746a9..5882961 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1310,7 +1310,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		case RTE_RELATION:
 			if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 				partitioned_rels =
-					get_partitioned_child_rels(root, rel->relid);
+					get_partitioned_child_rels(root, rel->relid, NULL);
 			break;
 		case RTE_SUBQUERY:
 			build_partitioned_rels = true;
@@ -1337,7 +1337,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2821662..85e3126 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2361,6 +2362,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6405,6 +6407,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6431,6 +6434,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7f146d6..3aad00b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1060,6 +1060,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1130,10 +1131,16 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/*
@@ -1471,6 +1478,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2088,6 +2096,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6118,11 +6127,16 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6134,6 +6148,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 3e0c3de..f28b381 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, Bitmapset **all_part_cols,
+						   List **partitioned_child_rels);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1464,15 +1465,20 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		Bitmapset  *all_part_cols = NULL;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc. Also, extract the
+		 * partition key columns of the root partitioned table. Those of the
+		 * child partitions would be collected during recursive expansion.
 		 */
+		pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
+								   &all_part_cols,
 								   &partitioned_child_rels);
 
 		/*
@@ -1490,6 +1496,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->all_part_cols = all_part_cols;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1566,7 +1573,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, Bitmapset **all_part_cols,
+						   List **partitioned_child_rels)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1618,9 +1626,15 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 		/* If this child is itself partitioned, recurse */
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			/* Also, collect the partition columns */
+			pull_child_partition_columns(all_part_cols, childrel, parentrel);
+
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, all_part_cols,
+									   partitioned_child_rels);
+		}
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 26567cb..326c858 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3162,6 +3162,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3175,6 +3177,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3242,6 +3245,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index 5c17213..58e98c0 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1224,6 +1224,7 @@ typedef struct
 	/* Target type when converting whole-row vars */
 	Oid			to_rowtype;
 	bool	   *found_whole_row;	/* output flag */
+	bool		coerced_var;	/* var is under ConvertRowTypeExpr */
 } map_variable_attnos_context;
 
 static Node *
@@ -1267,22 +1268,29 @@ map_variable_attnos_mutator(Node *node,
 					/* Don't convert unless necessary. */
 					if (context->to_rowtype != var->vartype)
 					{
-						ConvertRowtypeExpr *r;
-
 						/* Var itself is converted to the requested type. */
 						newvar->vartype = context->to_rowtype;
 
 						/*
-						 * And a conversion node on top to convert back to the
-						 * original type.
+						 * If this var is already under a ConvertRowtypeExpr,
+						 * we don't have to add another one.
 						 */
-						r = makeNode(ConvertRowtypeExpr);
-						r->arg = (Expr *) newvar;
-						r->resulttype = var->vartype;
-						r->convertformat = COERCE_IMPLICIT_CAST;
-						r->location = -1;
-
-						return (Node *) r;
+						if (!context->coerced_var)
+						{
+							ConvertRowtypeExpr *r;
+
+							/*
+							 * And a conversion node on top to convert back to
+							 * the original type.
+							 */
+							r = makeNode(ConvertRowtypeExpr);
+							r->arg = (Expr *) newvar;
+							r->resulttype = var->vartype;
+							r->convertformat = COERCE_IMPLICIT_CAST;
+							r->location = -1;
+
+							return (Node *) r;
+						}
 					}
 				}
 			}
@@ -1290,6 +1298,28 @@ map_variable_attnos_mutator(Node *node,
 		}
 		/* otherwise fall through to copy the var normally */
 	}
+	else if (IsA(node, ConvertRowtypeExpr))
+	{
+		ConvertRowtypeExpr *r = (ConvertRowtypeExpr *) node;
+
+		/*
+		 * If this is coercing a var (which is typical), convert only the var,
+		 * as against adding another ConvertRowtypeExpr over it.
+		 */
+		if (IsA(r->arg, Var))
+		{
+			ConvertRowtypeExpr *newnode;
+
+			newnode = (ConvertRowtypeExpr *) palloc(sizeof(ConvertRowtypeExpr));
+			*newnode = *r;
+			context->coerced_var = true;
+			newnode->arg = (Expr *) map_variable_attnos_mutator((Node *) r->arg, context);
+			context->coerced_var = false;
+
+			return (Node *) newnode;
+		}
+		/* Else fall through the expression tree mutator */
+	}
 	else if (IsA(node, Query))
 	{
 		/* Recurse into RTE subquery or not-yet-planned sublink subquery */
@@ -1321,6 +1351,7 @@ map_variable_attnos(Node *node,
 	context.map_length = map_length;
 	context.to_rowtype = to_rowtype;
 	context.found_whole_row = found_whole_row;
+	context.coerced_var = false;
 
 	*found_whole_row = false;
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 454a940..9b222b6 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -68,6 +68,12 @@ typedef struct PartitionDispatchData
 	int		   *indexes;
 } PartitionDispatchData;
 
+typedef struct PartitionWalker
+{
+	List	   *rels_list;
+	ListCell   *cur_cell;
+} PartitionWalker;
+
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void RelationBuildPartitionDesc(Relation relation);
@@ -80,12 +86,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
 
+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+									  Relation *parent);
+
 /* For tuple routing */
 extern PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int *num_parted, List **leaf_part_oids);
@@ -99,6 +109,9 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7708818..8e2bf5f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,10 +210,12 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 90a60ab..3034b01 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -511,6 +511,11 @@ typedef struct EState
 	struct dsa_area *es_query_dsa;
 } EState;
 
+/* For a given result relation, get its columns being inserted/updated. */
+#define GetInsertedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->insertedCols)
+#define GetUpdatedColumns(relinfo, estate) \
+	(rt_fetch((relinfo)->ri_RangeTableIndex, (estate)->es_range_table)->updatedCols)
 
 /*
  * ExecRowMark -
@@ -978,14 +983,31 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..6981f58 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index d50ff55..26f8fd3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1579,6 +1579,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2021,6 +2022,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e372f88..b38f2f1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index cef70b1..82c63f9 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,27 +198,345 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_c_100_200
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_100_200
+         Filter: (c > '97'::numeric)
+(16 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted | a | b  | c  
+--------------+---+----+----
+ (b,15,95)    | b | 15 | 95
+ (b,17,95)    | b | 17 | 95
+(2 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  95
+ part_c_1_100   | b | 17 |  95
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,110), (b,13,98), (b,15,106), (b,17,106)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 110
+ part_c_100_200 | b | 15 | 106
+ part_c_100_200 | b | 17 | 106
+ part_c_1_100   | b | 13 |  98
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,146), (b,13,147), (b,15,155), (b,17,155)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 146
+ part_c_100_200 | b | 13 | 147
+ part_c_100_200 | b | 15 | 155
+ part_c_100_200 | b | 17 | 155
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 110
+ part_c_100_200 | b | 17 | 106
+ part_c_100_200 | b | 19 | 106
+ part_c_1_100   | b | 15 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 146
+ part_c_100_200 | b | 16 | 147
+ part_c_100_200 | b | 17 | 155
+ part_c_100_200 | b | 19 | 155
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 150
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  55
+ part_c_1_100   | b | 17 |  55
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger c100_delete_trig ON part_c_100_200;
+drop trigger c100_update_trig ON part_c_100_200;
+drop trigger c100_insert_trig ON part_c_100_200;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
                                   Table "public.part_def"
@@ -226,6 +544,7 @@ create table part_def partition of range_parted default;
 --------+---------+-----------+----------+---------+----------+--------------+-------------
  a      | text    |           |          |         | extended |              | 
  b      | integer |           |          |         | plain    |              | 
+ c      | numeric |           |          |         | main     |              | 
 Partition of: range_parted DEFAULT
 Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
 
@@ -235,7 +554,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null).
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_def       | d |  9 |    
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+ partname | a  | b  |  c  
+----------+----+----+-----
+ part_def | ad |  1 |    
+ part_def | ad | 10 | 200
+ part_def | bd | 12 |  96
+ part_def | bd | 13 |  97
+ part_def | bd | 15 | 105
+ part_def | bd | 17 | 105
+ part_def | d  |  9 |    
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_def       | d |  9 |    
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +617,110 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
--- cleanup
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b ( ) ;
 drop table range_parted;
 drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 66d1fec..02d4c5e 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,203 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger c100_delete_trig ON part_c_100_200;
+drop trigger c100_update_trig ON part_c_100_200;
+drop trigger c100_insert_trig ON part_c_100_200;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +312,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +341,82 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
--- cleanup
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b ( ) ;
 drop table range_parted;
 drop table list_parted;

#162

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#161)

Re: UPDATE of partition key

On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I found out that, in case when there is a DELETE statement trigger
using transition tables, it's not only an issue of redundancy; it's a
correctness issue. Since for transition tables both DELETE and UPDATE
use the same old row tuplestore for capturing OLD table, that table
gets duplicate rows: one from ExecARDeleteTriggers() and another from
ExecARUpdateTriggers(). In presence of INSERT statement trigger using
transition tables, both INSERT and UPDATE events have separate
tuplestore, so duplicate rows don't show up in the UPDATE NEW table.
But, nevertheless, we need to prevent NEW rows to be collected in the
INSERT event tuplestore, and capture the NEW rows only in the UPDATE
event tuplestore.

In the attached patch, we first call ExecARUpdateTriggers(), and while
doing that, we first save the info that a NEW row is already captured
(mtstate->mt_transition_capture->tcs_update_old_table == true). If it
captured, we pass NULL transition_capture pointer to
ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
again capture an extra row.

Modified a testcase in update.sql by including DELETE statement
trigger that uses transition tables.

Ok, this fix looks correct to me, I will review the latest patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#163

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#162)

Re: UPDATE of partition key

On Mon, Sep 18, 2017 at 11:29 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 12 September 2017 at 11:57, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

In the attached patch, we first call ExecARUpdateTriggers(), and while
doing that, we first save the info that a NEW row is already captured
(mtstate->mt_transition_capture->tcs_update_old_table == true). If it
captured, we pass NULL transition_capture pointer to
ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
again capture an extra row.

Modified a testcase in update.sql by including DELETE statement
trigger that uses transition tables.

Ok, this fix looks correct to me, I will review the latest patch.

Please find few more comments.

+ * in which they appear in the PartitionDesc. Also, extract the
+ * partition key columns of the root partitioned table. Those of the
+ * child partitions would be collected during recursive expansion.
*/
+ pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
  lockmode, &root->append_rel_list,
+   &all_part_cols,

pcinfo->all_part_cols is only used in case of update, I think we can
call pull_child_partition_columns
only if rte has updateCols?

@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo

Index parent_relid;
List *child_rels;
+ Bitmapset *all_part_cols;
} PartitionedChildRelInfo;

I might be missing something, but do we really need to store
all_part_cols inside the
PartitionedChildRelInfo, can't we call pull_child_partition_columns
directly inside
inheritance_planner whenever we realize that RTE has some updateCols
and we want to
check the overlap?

+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+  Relation *parent);
+

I don't see these functions are used anywhere?

+typedef struct PartitionWalker
+{
+ List   *rels_list;
+ ListCell   *cur_cell;
+} PartitionWalker;
+

Same as above

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#164

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Dilip Kumar (#163)

Re: UPDATE of partition key

On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Please find few more comments.
+ * in which they appear in the PartitionDesc. Also, extract the
+ * partition key columns of the root partitioned table. Those of the
+ * child partitions would be collected during recursive expansion.
*/
+ pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
lockmode, &root->append_rel_list,
+   &all_part_cols,
pcinfo->all_part_cols is only used in case of update, I think we can
call pull_child_partition_columns
only if rte has updateCols?

@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo

Index parent_relid;
List *child_rels;
+ Bitmapset *all_part_cols;
} PartitionedChildRelInfo;

I might be missing something, but do we really need to store
all_part_cols inside the
PartitionedChildRelInfo, can't we call pull_child_partition_columns
directly inside
inheritance_planner whenever we realize that RTE has some updateCols
and we want to
check the overlap?

One thing we will have to do extra is : Open and close the
partitioned rels again. The idea was that we collect the bitmap
*while* we are already expanding through the tree and the rel is open.
Will check if this is feasible.

+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+  Relation *parent);
+

I don't see these functions are used anywhere?

+typedef struct PartitionWalker
+{
+ List   *rels_list;
+ ListCell   *cur_cell;
+} PartitionWalker;
+

Same as above

Yes, this was left out from the earlier implementation. Will have this
removed in the next updated patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#165

Dilip Kumar

dilipbalaut@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#164)

Re: UPDATE of partition key

On Tue, Sep 19, 2017 at 1:15 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote:
Please find few more comments.
+ * in which they appear in the PartitionDesc. Also, extract the
+ * partition key columns of the root partitioned table. Those of the
+ * child partitions would be collected during recursive expansion.
*/
+ pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
lockmode, &root->append_rel_list,
+   &all_part_cols,
pcinfo->all_part_cols is only used in case of update, I think we can
call pull_child_partition_columns
only if rte has updateCols?

@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo

Index parent_relid;
List *child_rels;
+ Bitmapset *all_part_cols;
} PartitionedChildRelInfo;

I might be missing something, but do we really need to store
all_part_cols inside the
PartitionedChildRelInfo, can't we call pull_child_partition_columns
directly inside
inheritance_planner whenever we realize that RTE has some updateCols
and we want to
check the overlap?
One thing we will have to do extra is : Open and close the
partitioned rels again. The idea was that we collect the bitmap
*while* we are already expanding through the tree and the rel is open.
Will check if this is feasible.

Oh, I see.

+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+  Relation *parent);
+
I don't see these functions are used anywhere?
+typedef struct PartitionWalker
+{
+ List   *rels_list;
+ ListCell   *cur_cell;
+} PartitionWalker;
+
Same as above
Yes, this was left out from the earlier implementation. Will have this
removed in the next updated patch.

Ok. I will continue my review thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#166

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#161)

Re: UPDATE of partition key

On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

[ new patch ]

This already fails to apply again. In general, I think it would be a
good idea to break this up into a patch series rather than have it as
a single patch. That would allow some bits to be applied earlier.
The main patch will probably still be pretty big, but at least we can
make things a little easier by getting some of the cleanup out of the
way first. Specific suggestions on what to break out below.

If the changes to rewriteManip.c are a marginal efficiency hack and
nothing more, then let's commit this part separately before the main
patch. If they're necessary for correctness, then please add a
comment explaining why they are necessary.

There appears to be no reason why the definitions of
GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
header file as a result of this patch. GetUpdatedColumns() was
previously defined in trigger.c and execMain.c and, post-patch, is
still called from only those files. GetInsertedColumns() was, and
remains, called only from execMain.c. If this were needed I'd suggest
doing it as a preparatory patch before the main patch, but it seems we
don't need it at all.

If I understand correctly, the reason for changing mt_partitions from
ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
ResultRelInfos for a partitioning hierarchy are allocated as a single
chunk, but we can't do that and also reuse the ResultRelInfos created
during InitPlan. I suggest that we do this as a preparatory patch.
Someone could argue that this is going the wrong way and that we ought
to instead make InitPlan() create all of the necessarily
ResultRelInfos, but it seems to me that eventually we probably want to
allow setting up ResultRelInfos on the fly for only those partitions
for which we end up needing them. The code already has some provision
for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel.
I don't think it's this patch's job to try to apply that kind of thing
to tuple routing, but it seems like in the long run if we're inserting
1 tuple into a table with 1000 partitions, or performing 1 update that
touches the partition key, it would be best not to create
ResultRelInfos for all 1000 partitions just for fun. But this sort of
thing seems much easier of mt_partitions is ResultRelInfo ** rather
than ResultRelInfo *, so I think what you have is going in the right
direction.

+         * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+         * does not belong to subplans, then it already matches the root tuple
+         * descriptor; although there is no such known scenario where this
+         * could happen.
+         */
+        if (rootResultRelInfo != resultRelInfo &&
+            mtstate->mt_persubplan_childparent_maps != NULL &&
+            resultRelInfo >= mtstate->resultRelInfo &&
+            resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+        {
+            int         map_index = resultRelInfo - mtstate->resultRelInfo;

I think you should Assert() that it doesn't happen instead of assuming
that it doesn't happen. IOW, remove the last two branches of the
if-condition, and then add an Assert() that map_index is sane.

It is not clear to me why we need both mt_perleaf_childparent_maps and
mt_persubplan_childparent_maps.

+         * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+         * update-partition-key operation, then this function is also called
+         * separately for DELETE and INSERT to capture transition table rows.
+         * In such case, either old tuple or new tuple can be NULL.

That seems pretty strange. I don't quite see how that's going to work
correctly. I'm skeptical about the idea that the old tuple capture
and new tuple capture can safely happen at different times.

I wonder if we should have a reloption controlling whether
update-tuple routing is enabled. I wonder how much more expensive it
is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
1000 subpartitions with this patch than without, assuming the update
succeeds in both cases.

I also wonder how efficient this implementation is in general. For
example, suppose you make a table with 1000 partitions each containing
10,000 tuples and update them all, and consider three scenarios: (1)
partition key not updated but all tuples subject to non-HOT updates
because the updated column is indexed, (2) partition key updated but
no tuple movement required as a result, (3) partition key updated and
all tuples move to a different partition. It would be useful to
compare the times, and also to look at perf profiles and see if there
are any obvious sources of inefficiency that can be squeezed out. It
wouldn't surprise me if tuple movement is a bit slower than the other
scenarios, but it would be nice to know how much slower and whether
the bottlenecks are anything that we can easily fix. I don't feel
that the performance constraints for this patch should be too tight,
because we're talking about being able to do something vs. not being
able to do it at all, but we should try to have it not stink.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#167

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#166)

Re: UPDATE of partition key

On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

[ new patch ]

This already fails to apply again. In general, I think it would be a
good idea to break this up into a patch series rather than have it as
a single patch. That would allow some bits to be applied earlier.
The main patch will probably still be pretty big, but at least we can
make things a little easier by getting some of the cleanup out of the
way first. Specific suggestions on what to break out below.

If the changes to rewriteManip.c are a marginal efficiency hack and
nothing more, then let's commit this part separately before the main
patch. If they're necessary for correctness, then please add a
comment explaining why they are necessary.

Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over
the other. But that was not causing any correctness issue. Will
extract these changes into separate patch.

There appears to be no reason why the definitions of
GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
header file as a result of this patch. GetUpdatedColumns() was
previously defined in trigger.c and execMain.c and, post-patch, is
still called from only those files. GetInsertedColumns() was, and
remains, called only from execMain.c. If this were needed I'd suggest
doing it as a preparatory patch before the main patch, but it seems we
don't need it at all.

In earlier versions of the patch, these functions were used in
nodeModifyTable.c as well. Now that those calls are not there in this
file, I will revert back the changes done for moving the definitions
into header file.

If I understand correctly, the reason for changing mt_partitions from
ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
ResultRelInfos for a partitioning hierarchy are allocated as a single
chunk, but we can't do that and also reuse the ResultRelInfos created
during InitPlan. I suggest that we do this as a preparatory patch.

Ok, will prepare a separate patch. Do you mean to include in that
patch the changes I did in ExecSetupPartitionTupleRouting() that
re-use the ResultRelInfo structures of per-subplan update result rels
?

Someone could argue that this is going the wrong way and that we ought
to instead make InitPlan() create all of the necessarily
ResultRelInfos, but it seems to me that eventually we probably want to
allow setting up ResultRelInfos on the fly for only those partitions
for which we end up needing them. The code already has some provision
for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel.
I don't think it's this patch's job to try to apply that kind of thing
to tuple routing, but it seems like in the long run if we're inserting
1 tuple into a table with 1000 partitions, or performing 1 update that
touches the partition key, it would be best not to create
ResultRelInfos for all 1000 partitions just for fun.

Yes makes sense.

But this sort of
thing seems much easier of mt_partitions is ResultRelInfo ** rather
than ResultRelInfo *, so I think what you have is going in the right
direction.

Ok.

+         * mtstate->resultRelInfo[]. Note: We assume that if the resultRelInfo
+         * does not belong to subplans, then it already matches the root tuple
+         * descriptor; although there is no such known scenario where this
+         * could happen.
+         */
+        if (rootResultRelInfo != resultRelInfo &&
+            mtstate->mt_persubplan_childparent_maps != NULL &&
+            resultRelInfo >= mtstate->resultRelInfo &&
+            resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1)
+        {
+            int         map_index = resultRelInfo - mtstate->resultRelInfo;

I think you should Assert() that it doesn't happen instead of assuming
that it doesn't happen. IOW, remove the last two branches of the
if-condition, and then add an Assert() that map_index is sane.

Ok.

It is not clear to me why we need both mt_perleaf_childparent_maps and
mt_persubplan_childparent_maps.

mt_perleaf_childparent_maps :
This is used for converting transition-captured
inserted/modified/deleted tuples from leaf to root partition, because
we need to have all the ROWS in the root partition attribute order.
This map is used only for tuples that are routed from root to leaf
partition during INSERT, or when tuples are routed from one leaf
partition to another leaf partition during update row movement. For
both of these operations, we need per-leaf maps, because during tuple
conversion, the source relation is among the mtstate->mt_partitions.

mt_persubplan_childparent_maps :
This is used at two places :

1. After an ExecUpdate() updates a row of a per-subplan update result
rel, we need to capture the tuple, so again we need to convert to the
root partition. Here, the source table is a per-subplan update result
rel; so we need to have per-subplan conversion map array. So after
UPDATE finishes with one update result rel,
node->mt_transition_capture->tcs_map shifts to the next element in the
mt_persubplan_childparent_maps array. :
ExecModifyTable()
{
....
node->mt_transition_capture->tcs_map =
node->mt_persubplan_childparent_maps[node->mt_whichplan];
....
}

2. In ExecInsert(), if it is part of update tuple routing, we need to
convert the tuple from the update result rel to the root partition. So
it re-uses this same conversion map.

Now, instead of these two maps having separate allocations, I have
arranged for the per-leaf map array to re-use the mapping allocations
made by per-subplan array elements, similar to how we are doing for
re-using the ResultRelInfos. But still the arrays themselves need to
be separate.

+         * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+         * update-partition-key operation, then this function is also called
+         * separately for DELETE and INSERT to capture transition table rows.
+         * In such case, either old tuple or new tuple can be NULL.
That seems pretty strange. I don't quite see how that's going to work
correctly. I'm skeptical about the idea that the old tuple capture
and new tuple capture can safely happen at different times.

Actually the tuple capture involves just adding the tuple into the
correct tuplestore for a particular event. There is no trigger event
added for tuple capture. Calling ExecARUpdateTriggers() with either
newtuple NULL or tupleid Invalid makes sure that it does not do
anything other than transition capture :

@@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate,
ResultRelInfo *relinfo,
                /* If transition tables are the only reason we're
here, return. */
                if (trigdesc == NULL ||
                        (event == TRIGGER_EVENT_DELETE &&
!trigdesc->trig_delete_after_row) ||
                        (event == TRIGGER_EVENT_INSERT &&
!trigdesc->trig_insert_after_row) ||
-                       (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row))
+                       (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row) ||
+                       (event == TRIGGER_EVENT_UPDATE && (oldtup ==
NULL || newtup == NULL)))
                        return;

Even if we imagine a single place or a single function that we could
call to do the OLD and NEW row capture, still the end result is going
to be the same : OLD row would go into
mtstate->mt_transition_capture->tcs_old_tuplestore, and NEW row would
end up in mtstate->mt_transition_capture->tcs_update_tuplestore. Note
that these are common tuple stores for all the partitions of the
partition tree.

(Actually I am still rebasing my patch over the recent changes where
tcs_update_tuplestore no more exists; instead we need to use
transition_capture->tcs_private->new_tuplestore).

When we access the OLD and NEW tables for UPDATE trigger, there is no
longer a co-relation as to which row of OLD TABLE correspond to which
row of the NEW TABLE for a given updated row. So, at exactly which
point OLD row and NEW row gets captured into their respective
tuplestores, and in which order, is not important.

Whereas, for the usual per ROW triggers, it is critical that the
trigger event has both the OLD and NEW row together in the same
trigger event, since they need to be both accessible in the same
trigger function.

Doing the OLD and NEW tables row capture separately is essential
because the DELETE and INSERT happen on different tables, so we are
not even sure if the insert is going to happen (thanks to triggers on
partitions, if any). If the insert is skipped, we should not capture
that tuple.

I wonder if we should have a reloption controlling whether
update-tuple routing is enabled. I wonder how much more expensive it
is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
1000 subpartitions with this patch than without, assuming the update
succeeds in both cases.

You mean to check how much the patch slows down things for the
existing updates involving no row movement ? And so have the reloption
to have an option to disable the logic that slows down things ?

I also wonder how efficient this implementation is in general. For
example, suppose you make a table with 1000 partitions each containing
10,000 tuples and update them all, and consider three scenarios: (1)
partition key not updated but all tuples subject to non-HOT updates
because the updated column is indexed, (2) partition key updated but
no tuple movement required as a result, (3) partition key updated and
all tuples move to a different partition. It would be useful to
compare the times, and also to look at perf profiles and see if there
are any obvious sources of inefficiency that can be squeezed out. It
wouldn't surprise me if tuple movement is a bit slower than the other
scenarios, but it would be nice to know how much slower and whether
the bottlenecks are anything that we can easily fix.

Ok yeah that would be helpful to remove any unnecessary slowness that
may have been caused due to the patch; will do.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#168

amul sul

sulamul@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#167)

Re: UPDATE of partition key

On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com>

wrote:

[ new patch ]

86 - (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row))
87 + (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row) ||
88 + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL ||
newtup == NULL)))
89 return;
90 }

Either of oldtup or newtup will be valid at a time & vice versa. Can we
improve
this check accordingly?

For e.g.:
(event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^
ItemPointerIsValid(newtup)))))

247
248 + /*
249 + * EDB: In case this is part of update tuple routing, put this row
into the
250 + * transition NEW TABLE if we are capturing transition tables. We
need to
251 + * do this separately for DELETE and INSERT because they happen on
252 + * different tables.
253 + */
254 + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_
capture)
255 + ExecARUpdateTriggers(estate, resultRelInfo, NULL,
256 + NULL,
257 + tuple,
258 + NULL,
259 + mtstate->mt_transition_capture);
260 +
261 list_free(recheckIndexes);

267
268 + /*
269 + * EDB: In case this is part of update tuple routing, put this row
into the
270 + * transition OLD TABLE if we are capturing transition tables. We
need to
271 + * do this separately for DELETE and INSERT because they happen on
272 + * different tables.
273 + */
274 + if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_
capture)
275 + ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
276 + oldtuple,
277 + NULL,
278 + NULL,
279 + mtstate->mt_transition_capture);
280 +

Initially, I wondered that why can't we have above code right after
ExecInsert() & ExecIDelete() in ExecUpdate respectively?

We can do that for ExecIDelete() but not easily in the ExecInsert() case,
because ExecInsert() internally searches the correct partition's
resultRelInfo
for an insert and before returning to ExecUpdate resultRelInfo is restored
to the old one. That's why current logic seems to be reasonable for now.
Is there anything that we can do?

Regards,
Amul

#169

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#167)

3 attachment(s)

Re: UPDATE of partition key

I have extracted a couple of changes into preparatory patches, as
explained below :

On 20 September 2017 at 21:27, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

[ new patch ]

This already fails to apply again. In general, I think it would be a
good idea to break this up into a patch series rather than have it as
a single patch. That would allow some bits to be applied earlier.
The main patch will probably still be pretty big, but at least we can
make things a little easier by getting some of the cleanup out of the
way first. Specific suggestions on what to break out below.

If the changes to rewriteManip.c are a marginal efficiency hack and
nothing more, then let's commit this part separately before the main
patch. If they're necessary for correctness, then please add a
comment explaining why they are necessary.

Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over
the other. But that was not causing any correctness issue. Will
extract these changes into separate patch.

The patch for the above change is :
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch

There appears to be no reason why the definitions of
GetInsertedColumns() and GetUpdatedColumns() need to be moved to a
header file as a result of this patch. GetUpdatedColumns() was
previously defined in trigger.c and execMain.c and, post-patch, is
still called from only those files. GetInsertedColumns() was, and
remains, called only from execMain.c. If this were needed I'd suggest
doing it as a preparatory patch before the main patch, but it seems we
don't need it at all.

In earlier versions of the patch, these functions were used in
nodeModifyTable.c as well. Now that those calls are not there in this
file, I will revert back the changes done for moving the definitions
into header file.

Did the above , and included in the attached revised patch
update-partition-key_v19.patch.

If I understand correctly, the reason for changing mt_partitions from
ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
ResultRelInfos for a partitioning hierarchy are allocated as a single
chunk, but we can't do that and also reuse the ResultRelInfos created
during InitPlan. I suggest that we do this as a preparatory patch.

Ok, will prepare a separate patch. Do you mean to include in that
patch the changes I did in ExecSetupPartitionTupleRouting() that
re-use the ResultRelInfo structures of per-subplan update result rels
?

Above changes are in attached
0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch.

Patches are to be applied in this order :

0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
update-partition-key_v19.patch

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patchapplication/octet-stream; name=0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patchDownload

From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Fri, 22 Sep 2017 09:54:15 +0530
Subject: [PATCH 1/2] Re-use UPDATE result rels created in InitPlan.

For UPDATE tuple routing, we need to have result rels for the leaf
partitions. Since we already have at least a subset of those result
rels in the form of UPDATE per-subplan result rels, arrange for
re-using them instead of creating new ones for all of the leaf
partitions.

For this, the mtstate->mt_partitions needs to be an array of
ResultRelInfo * rather than an array of ResultRelInfo. This way, when
a leaf partition already has a result rel allocated in the
mtstate->resultRelInfo, the mt_partitions array element would point to
this allocated structure.
---
 src/backend/commands/copy.c            |  12 ++--
 src/backend/executor/execMain.c        | 125 ++++++++++++++++++++++++++++-----
 src/backend/executor/nodeModifyTable.c |  75 ++++++++++++--------
 src/include/executor/executor.h        |   4 +-
 src/include/nodes/execnodes.h          |   2 +-
 5 files changed, 163 insertions(+), 55 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c6fa445..098bc66 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -168,7 +168,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -2451,13 +2451,15 @@ CopyFrom(CopyState cstate)
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2487,7 +2489,7 @@ CopyFrom(CopyState cstate)
 			for (i = 0; i < cstate->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2618,7 +2620,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2848,7 +2850,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 62fb05e..b31ab36 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3243,10 +3243,14 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
  *		entry for every leaf partition (required to convert input tuple based
@@ -3266,10 +3270,12 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
@@ -3278,7 +3284,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3287,11 +3295,38 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+											sizeof(ResultRelInfo *));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a
+		 * new result rel. The per-subplan resultrels and the resultrels of
+		 * the leaf partitions are both in the same canonical order. So while
+		 * going through the leaf partition oids, we need to keep track of the
+		 * next per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by
+		 * one while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -3300,19 +3335,70 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -3322,12 +3408,6 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
 		 * Verify result relation is a valid target for INSERT.
 		 */
@@ -3345,9 +3425,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to
+	 * the last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 845c409..a64b477 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -303,7 +303,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -1498,25 +1498,11 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ResultRelInfo *resultRelInfos;
 		int			numResultRelInfos;
 
-		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
-		{
-			/*
-			 * For INSERT via partitioned table, so we need TupleDescs based
-			 * on the partition routing table.
-			 */
-			resultRelInfos = mtstate->mt_partitions;
-			numResultRelInfos = mtstate->mt_num_partitions;
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
-			numResultRelInfos = mtstate->mt_nplans;
-		}
+		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
+							 mtstate->mt_num_partitions :
+							 mtstate->mt_nplans);
 
 		/*
 		 * Build array of conversion maps from each child's TupleDesc to the
@@ -1526,12 +1512,36 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		 */
 		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
-		for (i = 0; i < numResultRelInfos; ++i)
+
+		/* Choose the right set of partitions */
+		if (mtstate->mt_partition_dispatch_info != NULL)
+		{
+			/*
+			 * For tuple routing among partitions, we need TupleDescs based
+			 * on the partition routing table.
+			 */
+			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+
+			for (i = 0; i < numResultRelInfos; ++i)
+			{
+				mtstate->mt_transition_tupconv_maps[i] =
+					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+										   gettext_noop("could not convert row type"));
+			}
+		}
+		else
 		{
-			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-									   gettext_noop("could not convert row type"));
+			/* Otherwise we need the ResultRelInfo for each subplan. */
+			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+			for (i = 0; i < numResultRelInfos; ++i)
+			{
+				mtstate->mt_transition_tupconv_maps[i] =
+					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+										   gettext_noop("could not convert row type"));
+			}
 		}
 
 		/*
@@ -1935,13 +1945,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   NULL,
+									   0,
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
@@ -2014,14 +2026,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			   mtstate->mt_nplans == 1);
 		wcoList = linitial(node->withCheckOptionLists);
 		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
 			/* varno = node->nominalRelation */
 			mapped_wcoList = map_partition_varattnos(wcoList,
 													 node->nominalRelation,
@@ -2037,7 +2051,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2088,13 +2101,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
 		 * are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
 		returningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
 			/* varno = node->nominalRelation */
 			rlist = map_partition_varattnos(returningList,
 											node->nominalRelation,
@@ -2376,7 +2391,7 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7708818..cc1cc2a 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -207,10 +207,12 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c6d3021..9187f7a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -978,7 +978,7 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **mt_partition_tupconv_maps;
 	/* Per partition tuple conversion map */
 	TupleTableSlot *mt_partition_tuple_slot;
-- 
2.1.4

0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patchapplication/octet-stream; name=0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patchDownload

From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Fri, 22 Sep 2017 10:10:16 +0530
Subject: [PATCH 2/2] Prevent a redundant ConvertRowtypeExpr node.

In case RETURNING clause has a whole row var, for mapping this whole
row var from parent to child, we add a ConvertRowtypeExpr node on top
of the whole row var. This node's final result type is the parent
composite type. But for mapping a whole row var from one child
partition to the other child partitions, the child expression can
already have a ConvertRowtypeExpr. In such case, prevent another
ConvertRowtypeExpr expression on top of it. Instead, modify the
containing var of the already-existing ConvertRowtypeExpr.
---
 src/backend/rewrite/rewriteManip.c | 53 ++++++++++++++++++++++++++++++--------
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index 5c17213..58e98c0 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1224,6 +1224,7 @@ typedef struct
 	/* Target type when converting whole-row vars */
 	Oid			to_rowtype;
 	bool	   *found_whole_row;	/* output flag */
+	bool		coerced_var;	/* var is under ConvertRowTypeExpr */
 } map_variable_attnos_context;
 
 static Node *
@@ -1267,22 +1268,29 @@ map_variable_attnos_mutator(Node *node,
 					/* Don't convert unless necessary. */
 					if (context->to_rowtype != var->vartype)
 					{
-						ConvertRowtypeExpr *r;
-
 						/* Var itself is converted to the requested type. */
 						newvar->vartype = context->to_rowtype;
 
 						/*
-						 * And a conversion node on top to convert back to the
-						 * original type.
+						 * If this var is already under a ConvertRowtypeExpr,
+						 * we don't have to add another one.
 						 */
-						r = makeNode(ConvertRowtypeExpr);
-						r->arg = (Expr *) newvar;
-						r->resulttype = var->vartype;
-						r->convertformat = COERCE_IMPLICIT_CAST;
-						r->location = -1;
-
-						return (Node *) r;
+						if (!context->coerced_var)
+						{
+							ConvertRowtypeExpr *r;
+
+							/*
+							 * And a conversion node on top to convert back to
+							 * the original type.
+							 */
+							r = makeNode(ConvertRowtypeExpr);
+							r->arg = (Expr *) newvar;
+							r->resulttype = var->vartype;
+							r->convertformat = COERCE_IMPLICIT_CAST;
+							r->location = -1;
+
+							return (Node *) r;
+						}
 					}
 				}
 			}
@@ -1290,6 +1298,28 @@ map_variable_attnos_mutator(Node *node,
 		}
 		/* otherwise fall through to copy the var normally */
 	}
+	else if (IsA(node, ConvertRowtypeExpr))
+	{
+		ConvertRowtypeExpr *r = (ConvertRowtypeExpr *) node;
+
+		/*
+		 * If this is coercing a var (which is typical), convert only the var,
+		 * as against adding another ConvertRowtypeExpr over it.
+		 */
+		if (IsA(r->arg, Var))
+		{
+			ConvertRowtypeExpr *newnode;
+
+			newnode = (ConvertRowtypeExpr *) palloc(sizeof(ConvertRowtypeExpr));
+			*newnode = *r;
+			context->coerced_var = true;
+			newnode->arg = (Expr *) map_variable_attnos_mutator((Node *) r->arg, context);
+			context->coerced_var = false;
+
+			return (Node *) newnode;
+		}
+		/* Else fall through the expression tree mutator */
+	}
 	else if (IsA(node, Query))
 	{
 		/* Recurse into RTE subquery or not-yet-planned sublink subquery */
@@ -1321,6 +1351,7 @@ map_variable_attnos(Node *node,
 	context.map_length = map_length;
 	context.to_rowtype = to_rowtype;
 	context.found_whole_row = found_whole_row;
+	context.coerced_var = false;
 
 	*found_whole_row = false;
 
-- 
2.1.4

update-partition-key_v19.patchapplication/octet-stream; name=update-partition-key_v19.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index f5f74af..99b271f 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 1ab6dba..737c9e30 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1105,7 +1105,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1118,8 +1119,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1128,14 +1129,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2439,6 +2440,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 098bc66..7881720 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2730,7 +2730,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index e75a59d..873156b 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5428,7 +5433,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5477,12 +5487,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5495,12 +5520,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5520,7 +5545,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b31ab36..d48da8e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -104,9 +104,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
  * not appear to be any good header to put it into, given the structures that
@@ -1851,15 +1848,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1887,52 +1879,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1940,7 +1946,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2056,8 +2063,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3252,18 +3260,18 @@ EvalPlanQualEnd(EPQState *epqstate)
  *		every partitioned table in the partition tree
  * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *		with on entry for every leaf partition (required to convert input tuple
+ *		based on the root table's rowtype to a leaf partition's rowtype after
+ *		tuple routing is done)
  * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
  *		to manipulate any given leaf partition's rowtype after that partition
  *		is chosen by tuple-routing.
  * 'num_parted' receives the number of partitioned tables in the partition
  *		tree (= the number of entries in the 'pd' output array)
  * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *		tree (= the number of entries in the 'partitions' and
+ *		'perleaf_parentchild_maps' output arrays
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
@@ -3276,7 +3284,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 							   EState *estate,
 							   PartitionDispatch **pd,
 							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
+							   TupleConversionMap ***perleaf_parentchild_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3297,8 +3305,8 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	*num_partitions = list_length(leaf_parts);
 	*partitions = (ResultRelInfo **) palloc(*num_partitions *
 											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	*perleaf_parentchild_maps = (TupleConversionMap **) palloc0(*num_partitions *
+																sizeof(TupleConversionMap *));
 
 	if (num_update_rri != 0)
 	{
@@ -3405,11 +3413,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
+		(*perleaf_parentchild_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+																gettext_noop("could not convert row type"));
 
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3462,8 +3472,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 5a75e02..6b8af46 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a64b477..03bf01c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,6 +64,11 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
 
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -240,6 +246,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -265,6 +303,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -281,17 +320,49 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[].
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			/* resultRelInfo must be one of the per-subplan result rels. */
+			Assert(resultRelInfo >= mtstate->resultRelInfo &&
+				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -331,7 +402,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -345,29 +416,17 @@ ExecInsert(ModifyTableState *mtstate,
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -485,7 +544,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -621,9 +680,31 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -677,6 +758,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -684,6 +767,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
+
+	if (delete_skipped)
+		*delete_skipped = true;
 
 	/*
 	 * get information on the (current) result relation
@@ -848,12 +935,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -946,6 +1060,8 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
+
 
 	/*
 	 * abort the operation if not running transactions
@@ -1042,12 +1158,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1468,6 +1654,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1475,6 +1700,11 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
@@ -1489,71 +1719,108 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 									   RelationGetRelid(targetRelInfo->ri_RelationDesc),
 									   CMD_UPDATE);
 
+	if (mtstate->mt_transition_capture == NULL &&
+		mtstate->mt_oc_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.  (We can share these maps
 	 * between the regular and ON CONFLICT cases.)
 	 */
-	if (mtstate->mt_transition_capture != NULL ||
-		mtstate->mt_oc_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next
+	 * plan.  (INSERT operations set it every time, so we need not update
+	 * mtstate->mt_oc_transition_capture here.)
+	 */
+	if (mtstate->mt_transition_capture &&
+		mtstate->mt_persubplan_childparent_maps)
 	{
-		int			numResultRelInfos;
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
+
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
-							 mtstate->mt_nplans);
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
 
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
+	{
+		for (i = 0; i < numResultRelInfos; ++i)
 		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
+
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
+		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
-
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
-
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
-		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1659,15 +1926,15 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1783,7 +2050,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1828,9 +2096,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1903,6 +2174,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1940,36 +2220,64 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
+		TupleConversionMap **perleaf_parentchild_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
-									   NULL,
-									   0,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
-									   &partition_tupconv_maps,
+									   &perleaf_parentchild_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = perleaf_parentchild_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
 	 * Build state for collecting transition tuples.  This requires having a
 	 * valid trigger query context, so skip it in explain-only mode.
 	 */
@@ -2006,50 +2314,53 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
-			List	   *mapped_wcoList;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
 			resultRelInfo = mtstate->mt_partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
 		}
 	}
@@ -2061,7 +2372,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2098,10 +2409,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2110,10 +2421,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo = mtstate->mt_partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2359,6 +2674,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2393,11 +2709,23 @@ ExecEndModifyTable(ModifyTableState *node)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f1bed14..2d86593 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2260,6 +2261,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_BITMAPSET_FIELD(all_part_cols);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8b56b91..9428c2c 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -909,6 +909,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_BITMAPSET_FIELD(all_part_cols);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b83d919..2492cb8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2096,6 +2097,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2518,6 +2520,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BITMAPSET_FIELD(all_part_cols);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index fbf8330..0b1c70e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1562,6 +1562,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index a7866a9..946964a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1310,7 +1310,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		case RTE_RELATION:
 			if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 				partitioned_rels =
-					get_partitioned_child_rels(root, rel->relid);
+					get_partitioned_child_rels(root, rel->relid, NULL);
 			break;
 		case RTE_SUBQUERY:
 			build_partitioned_rels = true;
@@ -1337,7 +1337,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2821662..85e3126 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2361,6 +2362,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6405,6 +6407,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6431,6 +6434,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7f146d6..3aad00b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1060,6 +1060,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1130,10 +1131,16 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &all_part_cols);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/*
@@ -1471,6 +1478,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2088,6 +2096,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6118,11 +6127,16 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   Bitmapset **all_part_cols_p)
 {
 	List	   *result = NIL;
 	ListCell   *l;
@@ -6134,6 +6148,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (all_part_cols_p)
+				*all_part_cols_p = pc->all_part_cols;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 3e0c3de..f28b381 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, Bitmapset **all_part_cols,
+						   List **partitioned_child_rels);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1464,15 +1465,20 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		Bitmapset  *all_part_cols = NULL;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc. Also, extract the
+		 * partition key columns of the root partitioned table. Those of the
+		 * child partitions would be collected during recursive expansion.
 		 */
+		pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
+								   &all_part_cols,
 								   &partitioned_child_rels);
 
 		/*
@@ -1490,6 +1496,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->all_part_cols = all_part_cols;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1566,7 +1573,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, Bitmapset **all_part_cols,
+						   List **partitioned_child_rels)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1618,9 +1626,15 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 		/* If this child is itself partitioned, recurse */
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			/* Also, collect the partition columns */
+			pull_child_partition_columns(all_part_cols, childrel, parentrel);
+
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, all_part_cols,
+									   partitioned_child_rels);
+		}
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 26567cb..326c858 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3162,6 +3162,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3175,6 +3177,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3242,6 +3245,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 454a940..b714bc3 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -80,8 +80,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
@@ -99,6 +99,9 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cc1cc2a..8e2bf5f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -220,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9187f7a..9ba1976 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -979,15 +979,32 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..6981f58 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 48e6012..5e7d07c 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1633,6 +1633,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2075,6 +2076,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
@@ -2083,6 +2088,7 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	Bitmapset  *all_part_cols;
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e372f88..b38f2f1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2a4cf71..c6c15c5 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,7 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+										Bitmapset **all_part_cols_p);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index cef70b1..82c63f9 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,27 +198,345 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_c_100_200
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_100_200
+         Filter: (c > '97'::numeric)
+(16 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (12, 116, b).
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+ a | b | c 
+---+---+---
+(0 rows)
+
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+ a | b  |  c  
+---+----+-----
+ b | 12 | 116
+ b | 13 | 117
+ b | 15 | 125
+ b | 17 | 125
+(4 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (117, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_a_1_a_10  | a |  4 | 200
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_1_a_10  | a |  1 |    
+ part_b_1_b_10  | b |  7 | 117
+ part_b_1_b_10  | b |  9 | 125
+ part_c_100_200 | b | 11 | 125
+ part_c_100_200 | b | 12 | 116
+ part_c_100_200 | b | 15 | 199
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted | a | b  | c  
+--------------+---+----+----
+ (b,15,95)    | b | 15 | 95
+ (b,17,95)    | b | 17 | 95
+(2 rows)
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  95
+ part_c_1_100   | b | 17 |  95
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,110), (b,13,98), (b,15,106), (b,17,106)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 110
+ part_c_100_200 | b | 15 | 106
+ part_c_100_200 | b | 17 | 106
+ part_c_1_100   | b | 13 |  98
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96), (b,13,97), (b,15,105), (b,17,105), new table = (b,12,146), (b,13,147), (b,15,155), (b,17,155)
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 12 | 146
+ part_c_100_200 | b | 13 | 147
+ part_c_100_200 | b | 15 | 155
+ part_c_100_200 | b | 17 | 155
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 110
+ part_c_100_200 | b | 17 | 106
+ part_c_100_200 | b | 19 | 106
+ part_c_1_100   | b | 15 |  98
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 146
+ part_c_100_200 | b | 16 | 147
+ part_c_100_200 | b | 17 | 155
+ part_c_100_200 | b | 19 | 155
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+(6 rows)
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 150
+ part_a_1_a_10  | a |  1 |    
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_c_1_100   | b | 15 |  55
+ part_c_1_100   | b | 17 |  55
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger c100_delete_trig ON part_c_100_200;
+drop trigger c100_update_trig ON part_c_100_200;
+drop trigger c100_insert_trig ON part_c_100_200;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
                                   Table "public.part_def"
@@ -226,6 +544,7 @@ create table part_def partition of range_parted default;
 --------+---------+-----------+----------+---------+----------+--------------+-------------
  a      | text    |           |          |         | extended |              | 
  b      | integer |           |          |         | plain    |              | 
+ c      | numeric |           |          |         | main     |              | 
 Partition of: range_parted DEFAULT
 Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
 
@@ -235,7 +554,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null).
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_def       | d |  9 |    
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+ partname | a  | b  |  c  
+----------+----+----+-----
+ part_def | ad |  1 |    
+ part_def | ad | 10 | 200
+ part_def | bd | 12 |  96
+ part_def | bd | 13 |  97
+ part_def | bd | 15 | 105
+ part_def | bd | 17 | 105
+ part_def | d  |  9 |    
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  
+----------------+---+----+-----
+ part_a_10_a_20 | a | 10 | 200
+ part_a_1_a_10  | a |  1 |    
+ part_c_100_200 | b | 15 | 105
+ part_c_100_200 | b | 17 | 105
+ part_c_1_100   | b | 12 |  96
+ part_c_1_100   | b | 13 |  97
+ part_def       | d |  9 |    
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +617,110 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
--- cleanup
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b ( ) ;
 drop table range_parted;
 drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 66d1fec..02d4c5e 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,203 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b int,
+	c numeric
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that the sub plans are getting ordered in ascending bound order rather than ordered by the oid values.
+create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (c numeric, a text, b int);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (b int, c numeric, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, NULL), (''a'', 10, 200), (''b'', 12, 96), (''b'', 13, 97), (''b'', 15, 105), (''b'', 17, 105)'
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_1_100 set c = c + 20 where c = 96;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+select a, b, c from part_c_1_100 order by 1, 2, 3;
+select a, b, c from part_c_100_200 order by 1, 2, 3;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_c100_200 before update or insert on part_c_100_200
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_c100_200 ON part_c_100_200;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_100_200
+create trigger c100_delete_trig
+  after delete on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_update_trig
+  after update on part_c_100_200 for each statement execute procedure trigfunc();
+create trigger c100_insert_trig
+  after insert on part_c_100_200 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger c100_delete_trig ON part_c_100_200;
+drop trigger c100_update_trig ON part_c_100_200;
+drop trigger c100_insert_trig ON part_c_100_200;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +312,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +341,82 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
--- cleanup
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b ( ) ;
 drop table range_parted;
 drop table list_parted;

#170

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: amul sul (#168)

Re: UPDATE of partition key

On 21 September 2017 at 19:52, amul sul <sulamul@gmail.com> wrote:

On Wed, Sep 20, 2017 at 9:27 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

[ new patch ]

86 - (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row))
87 + (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row) ||
88 + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup
== NULL)))
89 return;
90 }

Either of oldtup or newtup will be valid at a time & vice versa. Can we
improve
this check accordingly?

For e.g.:
(event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^
ItemPointerIsValid(newtup)))))

Ok, I will be doing this as below :
-  (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

At other places in the function, oldtup and newtup are checked for
NULL, so to be consistent, I haven't used HeapTupleIsValid.

Actually, it won't happen that both oldtup and newtup are NULL ... in
either of delete, insert, or update, but I haven't added an Assert for
this, because that has been true even on HEAD.

Will include the above minor change in the next patch when more changes come in.

247
248 + /*
249 + * EDB: In case this is part of update tuple routing, put this row
into the
250 + * transition NEW TABLE if we are capturing transition tables. We
need to
251 + * do this separately for DELETE and INSERT because they happen on
252 + * different tables.
253 + */
254 + if (mtstate->operation == CMD_UPDATE &&
mtstate->mt_transition_capture)
255 + ExecARUpdateTriggers(estate, resultRelInfo, NULL,
256 + NULL,
257 + tuple,
258 + NULL,
259 + mtstate->mt_transition_capture);
260 +
261 list_free(recheckIndexes);

267
268 + /*
269 + * EDB: In case this is part of update tuple routing, put this row
into the
270 + * transition OLD TABLE if we are capturing transition tables. We
need to
271 + * do this separately for DELETE and INSERT because they happen on
272 + * different tables.
273 + */
274 + if (mtstate->operation == CMD_UPDATE &&
mtstate->mt_transition_capture)
275 + ExecARUpdateTriggers(estate, resultRelInfo, tupleid,
276 + oldtuple,
277 + NULL,
278 + NULL,
279 + mtstate->mt_transition_capture);
280 +

Initially, I wondered that why can't we have above code right after
ExecInsert() & ExecIDelete() in ExecUpdate respectively?

We can do that for ExecIDelete() but not easily in the ExecInsert() case,
because ExecInsert() internally searches the correct partition's
resultRelInfo
for an insert and before returning to ExecUpdate resultRelInfo is restored
to the old one. That's why current logic seems to be reasonable for now.
Is there anything that we can do?

Yes, resultRelInfo is different when we return from ExecInsert().
Also, I think the trigger and transition capture be done immediately
after the rows are inserted. This is true for existing code also.
Furthermore, there is a dependency of ExecARUpdateTriggers() on
ExecARInsertTriggers(). transition_capture is passed NULL if we
already captured the tuple in ExecARUpdateTriggers(). It looks simpler
to do all this at a single place.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#171

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#166)

2 attachment(s)

Re: UPDATE of partition key

Below are some performance figures. Overall, there does not appear to
be a noticeable difference in the figures in partition key updates
with and without row movement (which is surprising), and
non-partition-key updates with and without the patch.

All the values are in milliseconds.

Configuration :

shared_buffers = 8GB
maintenance_work_mem = 4GB
synchronous_commit = off
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '
max_wal_size = 5GB
max_connections = 200

The attached files were used to create a partition tree made up of 16
partitioned tables, each containing 125 partitions. First half of the
2000 partitions are filled with 10 million rows. Update row movement
moves the data to the other half of the partitions.

gen.sql : Creates the partitions.
insert.data : This data file is uploaded here [1]https://drive.google.com/open?id=0B_YJCqIAxKjeN3hMXzdDejlNYmlpWVJpaU9mWUhFRVhXTG5Z. Used "COPY ptab
from '$PWD/insert.data' "
index.sql : Optionally, Create index on column d.

The schema looks like this :

CREATE TABLE ptab (a date, b int, c int, d int) PARTITION BY RANGE (a, b);

CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 7501)
PARTITION BY range (c);
CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
for values from (1) to (81);
CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
for values from (81) to (161);
..........
..........
CREATE TABLE ptab_1_2 PARTITION OF ptab
for values from ('1900-01-01', 7501) to ('1900-01-01', 15001)
PARTITION BY range (c);
..........
..........

On 20 September 2017 at 00:06, Robert Haas <robertmhaas@gmail.com> wrote:

I wonder how much more expensive it
is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
1000 subpartitions with this patch than without, assuming the update
succeeds in both cases.

UPDATE query used : UPDATE ptab set d = d + 1 where d = 1; -- where d
is not a partition key of any of the partitions.
This query updates 8 rows out of 10 million rows.
With HEAD : 2953.691 , 2862.298 , 2855.286 , 2835.879 (avg : 2876)
With Patch : 2933.719 , 2832.463 , 2749.979 , 2820.416 (avg : 2834)
(All the values are in milliseconds.)

suppose you make a table with 1000 partitions each containing
10,000 tuples and update them all, and consider three scenarios: (1)
partition key not updated but all tuples subject to non-HOT updates
because the updated column is indexed, (2) partition key updated but
no tuple movement required as a result, (3) partition key updated and
all tuples move to a different partition.

Note that the following figures do not represent a consistent set of
figures. They keep on varying. For e.g. , even though the
partition-key-update without row movement appears to have taken a bit
more time with patch than with HEAD, a new set of tests run might even
end up the other way round.

NPK : 42089 (patch)
NPKI : 81593 (patch)
PK : 45250 (patch) , 44944 (HEAD)
PKR : 46701 (patch)

The above figures are in milliseconds. The explanations of the above
short-forms :

NPK :
Update of column that is not a partition-key.
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

NPKI :
Update of column that is not a partition-key. And this column is
indexed (Used attached file index.sql).
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

PK :
Update of partition key, but row movement does not occur. There are no
indexed columns.
UPDATE query used : UPDATE ptab set a = a + '1 hour'::interval ;

PKR :
Update of partition key, with all rows moved to other partitions.
There are no indexed columns.
UPDATE query used : UPDATE ptab set a = a + '2 years'::interval ;

[1]: https://drive.google.com/open?id=0B_YJCqIAxKjeN3hMXzdDejlNYmlpWVJpaU9mWUhFRVhXTG5Z

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#172

amul sul

sulamul@gmail.com

over 8 years ago

In reply to: amul sul (#160)

Re: UPDATE of partition key

On Wed, Sep 13, 2017 at 4:24 PM, amul sul <sulamul@gmail.com> wrote:

On Sun, Sep 10, 2017 at 8:47 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Sep 8, 2017 at 4:51 PM, amul sul <sulamul@gmail.com> wrote:

On Thu, May 18, 2017 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Wed, May 17, 2017 at 6:29 AM, Amit Kapila
<amit.kapila16@gmail.com>
wrote:

I think we can do this even without using an additional infomask
bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

Hmm. How would that work?

We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated). Can you explain what difficulty are you
envisioning?

Attaching WIP patch incorporates the above logic, although I am yet to
check
all the code for places which might be using ip_blkid. I have got a
small
query here,
do we need an error on HeapTupleSelfUpdated case as well?

No, because that case is anyway a no-op (or error depending on whether
is updated/deleted by same command or later command). Basically, even
if the row wouldn't have been moved to another partition, we would not
have allowed the command to proceed with the update. This handling is
to make commands fail rather than a no-op where otherwise (when the
tuple is not moved to another partition) the command would have
succeeded.

Thank you.

I've rebased patch against Amit Khandekar's latest patch (v17_rebased_2).
Also, added ip_blkid validation check in heap_get_latest_tid(), rewrite_heap_tuple()
& rewrite_heap_tuple() function, because only ItemPointerEquals() check is no
longer sufficient after this patch.

FYI, I have posted this patch in a separate thread :
/messages/by-id/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw@mail.gmail.com

Regards,
Amul

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#173

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#169)

Re: UPDATE of partition key

On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The patch for the above change is :
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch

Thinking about this a little more, I'm wondering about how this case
arises. I think that for this patch to avoid multiple conversions,
we'd have to be calling map_variable_attnos on an expression and then
calling map_variable_attnos on that expression again.

If I understand correctly, the reason for changing mt_partitions from
ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
ResultRelInfos for a partitioning hierarchy are allocated as a single
chunk, but we can't do that and also reuse the ResultRelInfos created
during InitPlan. I suggest that we do this as a preparatory patch.

Ok, will prepare a separate patch. Do you mean to include in that
patch the changes I did in ExecSetupPartitionTupleRouting() that
re-use the ResultRelInfo structures of per-subplan update result rels
?

Above changes are in attached
0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch.

No, not all of those changes. Just the adjustments to make
ModifyTableState's mt_partitions be of type ResultRelInfo ** rather
than ResultRelInfo *, and anything closely related to that. Not, for
example, the num_update_rri stuff.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#174

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#173)

Re: UPDATE of partition key

On 30 September 2017 at 01:26, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 29, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 22, 2017 at 1:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The patch for the above change is :
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch

Thinking about this a little more, I'm wondering about how this case
arises. I think that for this patch to avoid multiple conversions,
we'd have to be calling map_variable_attnos on an expression and then
calling map_variable_attnos on that expression again.

We are not calling map_variable_attnos() twice. The first time it
calls, there is already the ConvertRowtypeExpr node if the expression
is a whole row var. This node is already added from
adjust_appendrel_attrs(). So the conversion is done by two different
functions.

For ConvertRowtypeExpr, map_variable_attnos_mutator() recursively
calls map_variable_attnos_mutator() for ConvertRowtypeExpr->arg with
coerced_var=true.

I guess I didn't quite finish this thought, sorry. Maybe it's
obvious, but the point I was going for is: why would we do that, vs.
just converting once?

The first time ConvertRowtypeExpr node gets added in the expression is
when adjust_appendrel_attrs() is called for each of the child tables.
Here, for each of the child table, when the parent parse tree is
converted into the child parse tree, the whole row var (in RETURNING
or WITH CHECK OPTIONS expr) is wrapped with ConvertRowtypeExpr(), so
child parse tree (or the child WCO expr) has this ConvertRowtypeExpr
node.

The second time this node is added is during update-tuple-routing in
ExecInitModifyTable(), when map_partition_varattnos() is called for
each of the partitions to convert from the first per-subplan
RETURNING/WCO expression to the RETURNING/WCO expression belonging to
the leaf partition. This second conversion happens for the leaf
partitions which are not already present in per-subplan UPDATE result
rels.

So the first conversion is from parent to child while building
per-subplan plans, and the second is from first per-subplan child to
another child for building expressions of the leaf partitions.

So suppose the root partitioned table RETURNING expression is a whole
row var wr(r) where r is its composite type representing the root
table type.
Then, one of its UPDATE child tables will have its RETURNING
expression converted like this :
wr(r) ===> CRE(r) -> wr(c1)
where CRE(r) represents ConvertRowtypeExpr of result type r, which has
its arg pointing to wr(c1) which is a whole row var of composite type
c1 for the child table c1. So this node converts from composite type
of child table to composite type of root table.

Now, when the second conversion occurs for the leaf partition (i.e.
during update-tuple-routing), the conversion looks like this :
CRE(r) -> wr(c1) ===> CRE(r) -> wr(c2)
But W/o the 0002*ConvertRowtypeExpr*.patch the conversion would have
looked like this :
CRE(r) -> wr(c1) ===> CRE(r) -> CRE(c1) -> wr(c2)
In short, we omit the intermediate CRE(c1) node.

While writing this down, I observed that after multi-level partition
tree expansion was introduced, the child table expressions are not
converted directly from the root. Instead, they are converted from
their immediate parent. So there is a chain of conversions : to leaf
from its parent, to that parent from its parent, and so on from the
root. Effectively, during the first conversion, there are that many
ConvertRowtypeExpr nodes one above the other already present in the
UPDATE result rel expressions. But my patch handles the optimization
only for the leaf partition conversions.

If already has CRE : CRE(rr) -> wr(r)
Parent-to-child conversion ::: CRE(p) -> wr(r) ===> CRE(rr) ->
CRE(r) -> wr(c1)
W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#175

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#174)

Re: UPDATE of partition key

On Tue, Oct 3, 2017 at 8:16 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

While writing this down, I observed that after multi-level partition
tree expansion was introduced, the child table expressions are not
converted directly from the root. Instead, they are converted from
their immediate parent. So there is a chain of conversions : to leaf
from its parent, to that parent from its parent, and so on from the
root. Effectively, during the first conversion, there are that many
ConvertRowtypeExpr nodes one above the other already present in the
UPDATE result rel expressions. But my patch handles the optimization
only for the leaf partition conversions.

If already has CRE : CRE(rr) -> wr(r)
Parent-to-child conversion ::: CRE(p) -> wr(r) ===> CRE(rr) ->
CRE(r) -> wr(c1)
W patch : CRE(rr) -> CRE(r) -> wr(c1) ===> CRE(rr) -> CRE(r) -> wr(c2)

Maybe adjust_appendrel_attrs() should have a similar provision for
avoiding extra ConvertRowTypeExpr nodes?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#176

Amit Khandekar

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#173)

3 attachment(s)

Re: UPDATE of partition key

On 30 September 2017 at 01:23, Robert Haas <robertmhaas@gmail.com> wrote:

If I understand correctly, the reason for changing mt_partitions from
ResultRelInfo * to ResultRelInfo ** is that, currently, all of the
ResultRelInfos for a partitioning hierarchy are allocated as a single
chunk, but we can't do that and also reuse the ResultRelInfos created
during InitPlan. I suggest that we do this as a preparatory patch.

Ok, will prepare a separate patch. Do you mean to include in that
patch the changes I did in ExecSetupPartitionTupleRouting() that
re-use the ResultRelInfo structures of per-subplan update result rels
?

Above changes are in attached
0001-Re-use-UPDATE-result-rels-created-in-InitPlan.patch.

No, not all of those changes. Just the adjustments to make
ModifyTableState's mt_partitions be of type ResultRelInfo ** rather
than ResultRelInfo *, and anything closely related to that. Not, for
example, the num_update_rri stuff.

Ok. Attached is the patch modified to have changes only to handle
array of ResultRelInfo * instead of array of ResultRelInfo.

-------

On 4 October 2017 at 01:08, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 3, 2017 at 8:16 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

While writing this down, I observed that after multi-level partition
tree expansion was introduced, the child table expressions are not
converted directly from the root. Instead, they are converted from
their immediate parent. So there is a chain of conversions : to leaf
from its parent, to that parent from its parent, and so on from the
root. Effectively, during the first conversion, there are that many
ConvertRowtypeExpr nodes one above the other already present in the
UPDATE result rel expressions. But my patch handles the optimization
only for the leaf partition conversions.

Maybe adjust_appendrel_attrs() should have a similar provision for
avoiding extra ConvertRowTypeExpr nodes?

Yeah, I think we should be able to do that. Will check.

------

On 19 September 2017 at 13:15, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 18 September 2017 at 20:45, Dilip Kumar <dilipbalaut@gmail.com> wrote:
Please find few more comments.
+ * in which they appear in the PartitionDesc. Also, extract the
+ * partition key columns of the root partitioned table. Those of the
+ * child partitions would be collected during recursive expansion.
*/
+ pull_child_partition_columns(&all_part_cols, oldrelation, oldrelation);
expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
lockmode, &root->append_rel_list,
+   &all_part_cols,
pcinfo->all_part_cols is only used in case of update, I think we can
call pull_child_partition_columns
only if rte has updateCols?

@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo

Index parent_relid;
List *child_rels;
+ Bitmapset *all_part_cols;
} PartitionedChildRelInfo;

I might be missing something, but do we really need to store
all_part_cols inside the
PartitionedChildRelInfo, can't we call pull_child_partition_columns
directly inside
inheritance_planner whenever we realize that RTE has some updateCols
and we want to
check the overlap?
One thing we will have to do extra is : Open and close the
partitioned rels again. The idea was that we collect the bitmap
*while* we are already expanding through the tree and the rel is open.
Will check if this is feasible.

While giving more thought on this suggestion of Dilip's, I found out
that pull_child_partition_columns() is getting called with child_rel
and its immediate parent. That means, it maps the child rel attributes
to its immediate parent. If that immediate parent is not the root
partrel, then the conversion is not sufficient. We need to map child
rel attnos to root partrel attnos. So for a partition tree with 3 or
more levels, with the bottom partitioned rel having different att
ordering than the root, this will not work.

Before the commit that enabled recursive multi-level partition tree
expansion, pull_child_partition_columns() was always getting called
with child_rel and root rel. So this issue crept up when I rebased
over this commit, overlooking the fact that parent rel is the
immediate parent, not the root parent.

Anyways, I think Dilip's suggestion makes sense : we can do the
finding-all-part-cols work separately in inheritance_planner() using
the partitioned_rels handle. Re-opening the partitioned tables should
be cheap, because they have already been opened earlier, so they are
available in relcache. So did this as he suggested using new function
get_all_partition_cols(). While doing that, I have ensured that we use
the root rel to map all the child rel attnos. So the above issue is
fixed now.

Also added test scenarios that test the above issue. Namely, made the
partition tree 3 level, and added some specific scenarios where it
used to wrongly error out without trying to move the tuple, because it
determined partition-key is not updated.

---------

Though we re-use the update result rels, the WCO and Returning
expressions were not getting re-used from those update result rels.
This check was missing :
@@ -2059,7 +2380,7 @@ ExecInitModifyTable(ModifyTable *node, EState
*estate, int eflags)
for (i = 0; i < mtstate->mt_num_partitions; i++)
{
Relation partrel;
List *rlist;

   resultRelInfo = mtstate->mt_partitions[i];
+
+ /*
+ * If we are referring to a resultRelInfo from one of the update
+ * result rels, that result rel would already have a returningList
+ * built.
+ */
+ if (resultRelInfo->ri_projectReturning)
+    continue;
+
  partrel = resultRelInfo->ri_RelationDesc;

Added this check in the patch.

----------

On 22 September 2017 at 16:13, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 21 September 2017 at 19:52, amul sul <sulamul@gmail.com> wrote:

86 - (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row))
87 + (event == TRIGGER_EVENT_UPDATE &&
!trigdesc->trig_update_after_row) ||
88 + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup
== NULL)))
89 return;
90 }

Either of oldtup or newtup will be valid at a time & vice versa. Can we
improve
this check accordingly?

For e.g.:
(event == TRIGGER_EVENT_UPDATE && )(HeapTupleIsValid(oldtup) ^
ItemPointerIsValid(newtup)))))
Ok, I will be doing this as below :
-  (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL)))
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

Have done this in the attached patch.

--------

Attached are these patches :

Preparatory patches :
0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patch
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
Main patch :
update-partition-key_v20.patch

Thanks
-Amit Khandekar

Attachments:

0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patchapplication/octet-stream; name=0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patchDownload

From d93fd47b830c523aed5d5f8f8e1d7980ae9f03d4 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 4 Oct 2017 18:11:48 +0530
Subject: [PATCH 1/2] Prepare for re-using UPDATE result rels during tuple
 routing.

For UPDATE tuple routing, we need to have result rels for the leaf
partitions. Since we already have at least a subset of those result
rels in the form of UPDATE per-subplan result rels, arrange for
re-using them instead of creating new ones for all of the leaf
partitions.

This commit does not actually re-use the UPDATE result rels. It just
prepares the infrastructure for the same. This involves making the
mtstate->mt_partitions an array of ResultRelInfo * rather than an
array of ResultRelInfo. This way, in the future for e.g. during
UPDATE tuple routing, when a leaf partition already has a result rel
allocated in the mtstate->resultRelInfo, the mt_partitions array
element would point to this allocated structure rather than
allocating a new structure.
---
 src/backend/commands/copy.c            | 10 ++---
 src/backend/executor/execMain.c        | 25 +++++++++---
 src/backend/executor/nodeModifyTable.c | 73 ++++++++++++++++++++--------------
 src/include/executor/executor.h        |  2 +-
 src/include/nodes/execnodes.h          |  2 +-
 5 files changed, 69 insertions(+), 43 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e875880..ebaccfb 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -167,7 +167,7 @@ typedef struct CopyStateData
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;	/* Number of entries in the above array */
 	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo *partitions;	/* Per partition result relation */
+	ResultRelInfo **partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **partition_tupconv_maps;
 	TupleTableSlot *partition_tuple_slot;
 	TransitionCaptureState *transition_capture;
@@ -2459,7 +2459,7 @@ CopyFrom(CopyState cstate)
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
@@ -2495,7 +2495,7 @@ CopyFrom(CopyState cstate)
 			for (i = 0; i < cstate->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i].ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2626,7 +2626,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions + leaf_part_index;
+			resultRelInfo = cstate->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2856,7 +2856,7 @@ CopyFrom(CopyState cstate)
 		}
 		for (i = 0; i < cstate->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions + i;
+			ResultRelInfo *resultRelInfo = cstate->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 62fb05e..995b580 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3246,7 +3246,7 @@ EvalPlanQualEnd(EPQState *epqstate)
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo objects with one entry for
+ * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
  * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
  *		entry for every leaf partition (required to convert input tuple based
@@ -3269,7 +3269,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
@@ -3279,6 +3279,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	ListCell   *cell;
 	int			i;
 	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3287,12 +3288,24 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo *) palloc(*num_partitions *
-										   sizeof(ResultRelInfo));
+	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+											sizeof(ResultRelInfo *));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
 	/*
+	 * For inserts, we need to create all new result rels, so avoid
+	 * repeated pallocs by allocating memory for all the result rels in
+	 * bulk.
+	 *
+	 * XXX: In the future when we support update tuple routing, we will be
+	 * re-using the per-plan update result rels and thus avoid a new result
+	 * rel.
+	 */
+	leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+											  sizeof(ResultRelInfo));
+
+	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
 	 * (such as ModifyTableState) and released when the node finishes
@@ -3300,7 +3313,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = *partitions;
+	leaf_part_rri = leaf_part_arr;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
@@ -3345,7 +3358,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri++;
 		i++;
 	}
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 845c409..96c464e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -303,7 +303,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions + leaf_part_index;
+		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -1498,25 +1498,11 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ResultRelInfo *resultRelInfos;
 		int			numResultRelInfos;
 
-		/* Find the set of partitions so that we can find their TupleDescs. */
-		if (mtstate->mt_partition_dispatch_info != NULL)
-		{
-			/*
-			 * For INSERT via partitioned table, so we need TupleDescs based
-			 * on the partition routing table.
-			 */
-			resultRelInfos = mtstate->mt_partitions;
-			numResultRelInfos = mtstate->mt_num_partitions;
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			resultRelInfos = mtstate->resultRelInfo;
-			numResultRelInfos = mtstate->mt_nplans;
-		}
+		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
+							 mtstate->mt_num_partitions :
+							 mtstate->mt_nplans);
 
 		/*
 		 * Build array of conversion maps from each child's TupleDesc to the
@@ -1526,12 +1512,36 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		 */
 		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
-		for (i = 0; i < numResultRelInfos; ++i)
+
+		/* Choose the right set of partitions */
+		if (mtstate->mt_partition_dispatch_info != NULL)
+		{
+			/*
+			 * For tuple routing among partitions, we need TupleDescs based
+			 * on the partition routing table.
+			 */
+			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+
+			for (i = 0; i < numResultRelInfos; ++i)
+			{
+				mtstate->mt_transition_tupconv_maps[i] =
+					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+										   gettext_noop("could not convert row type"));
+			}
+		}
+		else
 		{
-			mtstate->mt_transition_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-									   gettext_noop("could not convert row type"));
+			/* Otherwise we need the ResultRelInfo for each subplan. */
+			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+			for (i = 0; i < numResultRelInfos; ++i)
+			{
+				mtstate->mt_transition_tupconv_maps[i] =
+					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+										   gettext_noop("could not convert row type"));
+			}
 		}
 
 		/*
@@ -1935,7 +1945,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo *partitions;
+		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
@@ -2014,14 +2024,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			   mtstate->mt_nplans == 1);
 		wcoList = linitial(node->withCheckOptionLists);
 		plan = mtstate->mt_plans[0];
-		resultRelInfo = mtstate->mt_partitions;
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
 			/* varno = node->nominalRelation */
 			mapped_wcoList = map_partition_varattnos(wcoList,
 													 node->nominalRelation,
@@ -2037,7 +2049,6 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
-			resultRelInfo++;
 		}
 	}
 
@@ -2088,13 +2099,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
 		 * are handled above.
 		 */
-		resultRelInfo = mtstate->mt_partitions;
 		returningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
+			Relation	partrel;
 			List	   *rlist;
 
+			resultRelInfo = mtstate->mt_partitions[i];
+			partrel = resultRelInfo->ri_RelationDesc;
+
 			/* varno = node->nominalRelation */
 			rlist = map_partition_varattnos(returningList,
 											node->nominalRelation,
@@ -2376,7 +2389,7 @@ ExecEndModifyTable(ModifyTableState *node)
 	}
 	for (i = 0; i < node->mt_num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions + i;
+		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7708818..2f54031 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -210,7 +210,7 @@ extern void ExecSetupPartitionTupleRouting(Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
-							   ResultRelInfo **partitions,
+							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c6d3021..9187f7a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -978,7 +978,7 @@ typedef struct ModifyTableState
 	int			mt_num_dispatch;	/* Number of entries in the above array */
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
-	ResultRelInfo *mt_partitions;	/* Per partition result relation */
+	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
 	TupleConversionMap **mt_partition_tupconv_maps;
 	/* Per partition tuple conversion map */
 	TupleTableSlot *mt_partition_tuple_slot;
-- 
2.1.4

0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patchapplication/octet-stream; name=0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patchDownload

From 4541795a758ab729b5015d1f878abd2f1d1eae6a Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 4 Oct 2017 18:20:49 +0530
Subject: [PATCH 2/2] Prevent a redundant ConvertRowtypeExpr node.

In case RETURNING clause has a whole row var, for mapping this whole
row var from parent to child, we add a ConvertRowtypeExpr node on top
of the whole row var. This node's final result type is the parent
composite type. But for mapping a whole row var from one child
partition to the other child partitions, the child expression can
already have a ConvertRowtypeExpr. In such case, prevent another
ConvertRowtypeExpr expression on top of it. Instead, modify the
containing var of the already-existing ConvertRowtypeExpr.
---
 src/backend/rewrite/rewriteManip.c | 53 ++++++++++++++++++++++++++++++--------
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/src/backend/rewrite/rewriteManip.c b/src/backend/rewrite/rewriteManip.c
index c5773ef..9290c7f 100644
--- a/src/backend/rewrite/rewriteManip.c
+++ b/src/backend/rewrite/rewriteManip.c
@@ -1224,6 +1224,7 @@ typedef struct
 	/* Target type when converting whole-row vars */
 	Oid			to_rowtype;
 	bool	   *found_whole_row;	/* output flag */
+	bool		coerced_var;	/* var is under ConvertRowTypeExpr */
 } map_variable_attnos_context;
 
 static Node *
@@ -1267,22 +1268,29 @@ map_variable_attnos_mutator(Node *node,
 					/* Don't convert unless necessary. */
 					if (context->to_rowtype != var->vartype)
 					{
-						ConvertRowtypeExpr *r;
-
 						/* Var itself is converted to the requested type. */
 						newvar->vartype = context->to_rowtype;
 
 						/*
-						 * And a conversion node on top to convert back to the
-						 * original type.
+						 * If this var is already under a ConvertRowtypeExpr,
+						 * we don't have to add another one.
 						 */
-						r = makeNode(ConvertRowtypeExpr);
-						r->arg = (Expr *) newvar;
-						r->resulttype = var->vartype;
-						r->convertformat = COERCE_IMPLICIT_CAST;
-						r->location = -1;
-
-						return (Node *) r;
+						if (!context->coerced_var)
+						{
+							ConvertRowtypeExpr *r;
+
+							/*
+							 * And a conversion node on top to convert back to
+							 * the original type.
+							 */
+							r = makeNode(ConvertRowtypeExpr);
+							r->arg = (Expr *) newvar;
+							r->resulttype = var->vartype;
+							r->convertformat = COERCE_IMPLICIT_CAST;
+							r->location = -1;
+
+							return (Node *) r;
+						}
 					}
 				}
 			}
@@ -1290,6 +1298,28 @@ map_variable_attnos_mutator(Node *node,
 		}
 		/* otherwise fall through to copy the var normally */
 	}
+	else if (IsA(node, ConvertRowtypeExpr))
+	{
+		ConvertRowtypeExpr *r = (ConvertRowtypeExpr *) node;
+
+		/*
+		 * If this is coercing a var (which is typical), convert only the var,
+		 * as against adding another ConvertRowtypeExpr over it.
+		 */
+		if (IsA(r->arg, Var))
+		{
+			ConvertRowtypeExpr *newnode;
+
+			newnode = (ConvertRowtypeExpr *) palloc(sizeof(ConvertRowtypeExpr));
+			*newnode = *r;
+			context->coerced_var = true;
+			newnode->arg = (Expr *) map_variable_attnos_mutator((Node *) r->arg, context);
+			context->coerced_var = false;
+
+			return (Node *) newnode;
+		}
+		/* Else fall through the expression tree mutator */
+	}
 	else if (IsA(node, Query))
 	{
 		/* Recurse into RTE subquery or not-yet-planned sublink subquery */
@@ -1321,6 +1351,7 @@ map_variable_attnos(Node *node,
 	context.map_length = map_length;
 	context.to_rowtype = to_rowtype;
 	context.found_whole_row = found_whole_row;
+	context.coerced_var = false;
 
 	*found_whole_row = false;
 
-- 
2.1.4

update-partition-key_v20.patchapplication/octet-stream; name=update-partition-key_v20.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 8a1619f..28cfc1a 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index f5f74af..99b271f 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 1ab6dba..737c9e30 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1105,7 +1105,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1118,8 +1119,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1128,14 +1129,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2439,6 +2440,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ebaccfb..6adac80 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2466,6 +2466,8 @@ CopyFrom(CopyState cstate)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2736,7 +2738,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index e75a59d..31de746 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5428,7 +5433,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5477,12 +5487,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5495,12 +5520,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5520,7 +5545,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 995b580..d48da8e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -104,9 +104,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
  * not appear to be any good header to put it into, given the structures that
@@ -1851,15 +1848,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1887,52 +1879,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1940,7 +1946,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2056,8 +2063,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3243,34 +3251,40 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
  * 'partitions' receives an array of ResultRelInfo* objects with one entry for
  *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *		with on entry for every leaf partition (required to convert input tuple
+ *		based on the root table's rowtype to a leaf partition's rowtype after
+ *		tuple routing is done)
  * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
  *		to manipulate any given leaf partition's rowtype after that partition
  *		is chosen by tuple-routing.
  * 'num_parted' receives the number of partitioned tables in the partition
  *		tree (= the number of entries in the 'pd' output array)
  * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *		tree (= the number of entries in the 'partitions' and
+ *		'perleaf_parentchild_maps' output arrays
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
 							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
+							   TupleConversionMap ***perleaf_parentchild_maps,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3278,8 +3292,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
 	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3290,20 +3305,35 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	*num_partitions = list_length(leaf_parts);
 	*partitions = (ResultRelInfo **) palloc(*num_partitions *
 											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	*perleaf_parentchild_maps = (TupleConversionMap **) palloc0(*num_partitions *
+																sizeof(TupleConversionMap *));
 
-	/*
-	 * For inserts, we need to create all new result rels, so avoid
-	 * repeated pallocs by allocating memory for all the result rels in
-	 * bulk.
-	 *
-	 * XXX: In the future when we support update tuple routing, we will be
-	 * re-using the per-plan update result rels and thus avoid a new result
-	 * rel.
-	 */
-	leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
-											  sizeof(ResultRelInfo));
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a
+		 * new result rel. The per-subplan resultrels and the resultrels of
+		 * the leaf partitions are both in the same canonical order. So while
+		 * going through the leaf partition oids, we need to keep track of the
+		 * next per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by
+		 * one while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -3313,36 +3343,83 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = leaf_part_arr;
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
-
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
+		(*perleaf_parentchild_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+																gettext_noop("could not convert row type"));
 
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3358,9 +3435,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to
+	 * the last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
@@ -3386,8 +3472,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 5a75e02..6b8af46 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 96c464e..efb8bfd 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,6 +64,11 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
 
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -240,6 +246,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -265,6 +303,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -281,17 +320,49 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[].
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			/* resultRelInfo must be one of the per-subplan result rels. */
+			Assert(resultRelInfo >= mtstate->resultRelInfo &&
+				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -331,7 +402,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -345,29 +416,17 @@ ExecInsert(ModifyTableState *mtstate,
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -485,7 +544,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -621,9 +680,31 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -677,6 +758,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -684,6 +767,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
+
+	if (delete_skipped)
+		*delete_skipped = true;
 
 	/*
 	 * get information on the (current) result relation
@@ -848,12 +935,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -946,6 +1060,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1042,12 +1157,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1468,6 +1653,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1475,6 +1699,11 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
@@ -1489,71 +1718,108 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 									   RelationGetRelid(targetRelInfo->ri_RelationDesc),
 									   CMD_UPDATE);
 
+	if (mtstate->mt_transition_capture == NULL &&
+		mtstate->mt_oc_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.  (We can share these maps
 	 * between the regular and ON CONFLICT cases.)
 	 */
-	if (mtstate->mt_transition_capture != NULL ||
-		mtstate->mt_oc_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next
+	 * plan.  (INSERT operations set it every time, so we need not update
+	 * mtstate->mt_oc_transition_capture here.)
+	 */
+	if (mtstate->mt_transition_capture &&
+		mtstate->mt_persubplan_childparent_maps)
 	{
-		int			numResultRelInfos;
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
-							 mtstate->mt_nplans);
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
 
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
+	{
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
+
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
 		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
-
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
-
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
-		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1659,15 +1925,15 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1783,7 +2049,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1828,9 +2095,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1903,6 +2173,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1940,34 +2219,64 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
+		TupleConversionMap **perleaf_parentchild_maps;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
-									   &partition_tupconv_maps,
+									   &perleaf_parentchild_maps,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = perleaf_parentchild_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
 	 * Build state for collecting transition tuples.  This requires having a
 	 * valid trigger query context, so skip it in explain-only mode.
 	 */
@@ -2004,50 +2313,62 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
-			List	   *mapped_wcoList;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
 		}
 	}
@@ -2059,7 +2380,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2096,26 +2417,38 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
-			resultRelInfo++;
 		}
 	}
 	else
@@ -2357,6 +2690,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2391,11 +2725,23 @@ ExecEndModifyTable(ModifyTableState *node)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c1a83ca..d8caa5ac 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 2532edc..77066e2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -367,6 +367,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2095,6 +2096,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 07ba691..4eec17f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1561,6 +1561,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2821662..85e3126 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -277,6 +277,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2361,6 +2362,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6405,6 +6407,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6431,6 +6434,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7f146d6..fece5df 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -111,6 +111,10 @@ typedef struct
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
+static void get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols);
 static void inheritance_planner(PlannerInfo *root);
 static void grouping_planner(PlannerInfo *root, bool inheritance_update,
 				 double tuple_fraction);
@@ -1016,6 +1020,40 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
 }
 
 /*
+ * get_all_partition_cols
+ *	  Get attribute numbers of all partition key columns of all the partitioned
+ *    tables.
+ *
+ * All the child partition attribute numbers are converted to the root
+ * partitioned table.
+ */
+static void
+get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols)
+{
+	ListCell   *lc;
+	Oid			root_relid = getrelid(root_rti, rtables);
+	Relation	root_rel;
+
+	/* The caller must have already locked all the partitioned tables. */
+	root_rel = heap_open(root_relid, NoLock);
+	*all_part_cols = NULL;
+	foreach(lc, partitioned_rels)
+	{
+		Index		rti = lfirst_int(lc);
+		Oid			relid = getrelid(rti, rtables);
+		Relation	part_rel = heap_open(relid, NoLock);
+
+		pull_child_partition_columns(all_part_cols, part_rel, root_rel);
+		heap_close(part_rel, NoLock);
+	}
+
+	heap_close(root_rel, NoLock);
+}
+
+/*
  * inheritance_planner
  *	  Generate Paths in the case where the result relation is an
  *	  inheritance set.
@@ -1060,6 +1098,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1130,10 +1169,23 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
 		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		/*
+		 * Retrieve the partition key columns of all the partitioned tables,
+		 * so as to check whether any of the columns being updated is
+		 * a partition key of any of the partition tables.
+		 */
+		get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+							 partitioned_rels, &all_part_cols);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/*
@@ -1471,6 +1523,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2088,6 +2141,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6118,6 +6172,10 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 26567cb..326c858 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3162,6 +3162,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3175,6 +3177,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3242,6 +3245,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 454a940..b714bc3 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -80,8 +80,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
@@ -99,6 +99,9 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 2f54031..8e2bf5f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,6 +210,8 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9187f7a..9ba1976 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -979,15 +979,32 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..6981f58 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 48e6012..432f17e 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1633,6 +1633,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2075,6 +2076,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e372f88..b38f2f1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index cef70b1..a49980b 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,367 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (null, 85, b, 15, 105).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, b, 7, 2).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
+Partition constraint: (NOT (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +566,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +629,110 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
--- cleanup
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 drop table range_parted;
 drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 66d1fec..0ec5bb2 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,229 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +338,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +367,82 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
--- cleanup
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
 drop table range_parted;
 drop table list_parted;

#177

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#176)

Re: UPDATE of partition key

On Wed, Oct 4, 2017 at 9:51 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Preparatory patches :
0001-Prepare-for-re-using-UPDATE-result-rels-during-tuple.patch
0002-Prevent-a-redundant-ConvertRowtypeExpr-node.patch
Main patch :
update-partition-key_v20.patch

Committed 0001 with a few tweaks and 0002 unchanged. Please check
whether everything looks OK.

Is anybody still reviewing the main patch here? (It would be good if
the answer is "yes".)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#178

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Robert Haas (#177)

Re: UPDATE of partition key

On 2017/10/13 6:18, Robert Haas wrote:

Is anybody still reviewing the main patch here? (It would be good if
the answer is "yes".)

I am going to try to look at the latest version over the weekend and early
next week.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#179

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 8 years ago

In reply to: Amit Khandekar (#176)

Re: UPDATE of partition key

Hi Amit.

On 2017/10/04 22:51, Amit Khandekar wrote:

Main patch :
update-partition-key_v20.patch

Guess you're already working on it but the patch needs a rebase. A couple
of hunks in the patch to execMain.c and nodeModifyTable.c fail.

Meanwhile a few comments:

+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+                            Relation rel,
+                            Relation parent)

Nitpick: don't we normally list the output argument(s) at the end? Also,
"bitmapset" could be renamed to something that conveys what it contains?

+       if (partattno != 0)
+           child_keycols =
+               bms_add_member(child_keycols,
+                              partattno -
FirstLowInvalidHeapAttributeNumber);
+   }
+   foreach(lc, partexprs)
+   {

Elsewhere (in quite a few places), we don't iterate over partexprs
separately like this, although I'm not saying it is bad, just different
from other places.

+ * the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another
partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to
another
+ *  partition (to capture NEW row). This is done separately because
DELETE and
+ *  INSERT happen on different tables.

Extra space at the beginning from the 2nd line onwards.

+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
== NULL))))

Is there some reason why a bitwise operator is used here?

+ * 'update_rri' has the UPDATE per-subplan result rels.

Could you explain why they are being received as input here?

+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *     with on entry for every leaf partition (required to convert input
tuple
+ *     based on the root table's rowtype to a leaf partition's rowtype after
+ *     tuple routing is done)

Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that
they are maps needed for "tuple conversion". And the other field holding
the reverse map as leaf_rev_tupconv_maps. Either that or use underscores
to separate words, but then it gets too long I guess.

+       tuple = ConvertPartitionTupleSlot(mtstate,
+
mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+

The 2nd line here seems to have gone over 80 characters.

ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
interface. I guess it could simply have the following interface:

static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
HeapTuple tuple, bool is_update);

And figure out, based on the value of is_update, which map to use and
which slot to set *p_new_slot to (what is now "new_slot" argument).
You're getting mtstate here anyway, which contains all the information you
need here. It seems better to make that (selecting which map and which
slot) part of the function's implementation if we're having this function
at all, imho. Maybe I'm missing some details there, but my point still
remains that we should try to put more logic in that function instead of
it just do the mechanical tuple conversion.

+         * We have already checked partition constraints above, so skip them
+         * below.

How about: ", so skip checking here."?

ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
try to reuse the per-subplan child-to-parent map as per-leaf
child-to-parent map could be simplified a bit. I mean the following code:

+    /*
+     * But for Updates, we can share the per-subplan maps with the per-leaf
+     * maps.
+     */
+    update_rri_index = 0;
+    update_rri = mtstate->resultRelInfo;
+    if (mtstate->mt_nplans > 0)
+        cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);

-        /* Choose the right set of partitions */
-        if (mtstate->mt_partition_dispatch_info != NULL)
+    for (i = 0; i < numResultRelInfos; ++i)
+    {
<snip>

How about (pseudo-code):

j = 0;
for (i = 0; i < n_leaf_parts; i++)
{
if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
{
leaf_childparent_map[i] = subplan_childparent_map[j];
j++;
}
else
{
leaf_childparent_map[i] = new map
}
}

I think the above would also be useful in ExecSetupPartitionTupleRouting()
where you've added similar code to try to reuse per-subplan ResultRelInfos.

In ExecInitModifyTable(), can we try to minimize the number of places
where update_tuple_routing_needed is being set. Currently, it's being set
in 3 places:

+ bool update_tuple_routing_needed = node->part_cols_updated;

+        /*
+         * If this is an UPDATE and a BEFORE UPDATE trigger is present,
we may
+         * need to do update tuple routing.
+         */
+        if (resultRelInfo->ri_TrigDesc &&
+            resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+            operation == CMD_UPDATE)
+            update_tuple_routing_needed = true;

+    /* Decide whether we need to perform update tuple routing. */
+    if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+        update_tuple_routing_needed = false;

In the following:

         ExecSetupPartitionTupleRouting(rel,
+                                       (operation == CMD_UPDATE ?
+                                        mtstate->resultRelInfo : NULL),
+                                       (operation == CMD_UPDATE ? nplans
: 0),

Can the second parameter be made to not span two lines? It was a bit hard
for me to see that there two new parameters.

+ * Construct mapping from each of the resultRelInfo attnos to the root

Maybe it's odd to say "resultRelInfo attno", because it's really the
underlying partition whose attnos we're talking about as being possibly
different from the root table's attnos.

+ * descriptor. In such case we need to convert tuples to the root

s/In such case/In such a case,/

By the way, I've seen in a number of places that the patch calls "root
table" a partition. Not just in comments, but also a variable appears to
be given a name which contains rootpartition. I can see only one instance
where root is called a partition in the existing source code, but it seems
to have been introduced only recently:

allpaths.c:1333: * A root partition will already have a

+         * qual for each partition. Note that, if there are SubPlans in
there,
+         * they all end up attached to the one parent Plan node.

The sentence starting with "Note that, " is a bit unclear.

+        Assert(update_tuple_routing_needed ||
+               (operation == CMD_INSERT &&
+                list_length(node->withCheckOptionLists) == 1 &&
+                mtstate->mt_nplans == 1));

The comment I complained about above is perhaps about this Assert.

-            List       *mapped_wcoList;
+            List       *mappedWco;

Not sure why this rename. After this rename, it's now inconsistent with
the code above which handles non-partitioned case, which still calls it
wcoList. Maybe, because you introduced firstWco and then this line:

+ firstWco = linitial(node->withCheckOptionLists);

but note that each member of node->withCheckOptionLists is also a list, so
the original naming. Also, further below, you're assigning mappedWco to
a List * field.

+ resultRelInfo->ri_WithCheckOptions = mappedWco;

Comments on the optimizer changes:

+get_all_partition_cols(List *rtables,

Did you mean rtable?

get_all_partition_cols() seems to go over the rtable as many times as
there are partitioned tables in the tree. Is there a way to do this work
somewhere else? Maybe when the partitioned_rels list is built in the
first place. But that would require us to make changes to extract
partition columns in some place (prepunion.c) where it's hard to justify
why it's being done there at all.

+        get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+                             partitioned_rels, &all_part_cols);

Two more spaces needed on the 2nd line.

+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *

Dead comment? Aha, so here's where all_part_cols was being set before...

+ TupleTableSlot *mt_rootpartition_tuple_slot;

I guess I was complaining about this field where you call root a
partition. Maybe, mt_root_tuple_slot would suffice.

Thanks again for working on this.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#180

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Langote (#179)

1 attachment(s)

Re: UPDATE of partition key

On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Amit.

On 2017/10/04 22:51, Amit Khandekar wrote:

Main patch :
update-partition-key_v20.patch

Guess you're already working on it but the patch needs a rebase. A couple
of hunks in the patch to execMain.c and nodeModifyTable.c fail.

Thanks for taking up this review Amit. Attached is the rebased
version. Will get back on your review comments and updated patch soon.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v20_rebased.patchapplication/octet-stream; name=update-partition-key_v20_rebased.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b05a9c2..5a436a1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,20 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</> causes a row to move from one partition to
+       another, there is a chance that another concurrent <command>UPDATE</> or
+       <command>DELETE</> misses this row. Suppose, during the row movement,
+       the row is still visible for the concurrent session, and it is about to
+       do an <command>UPDATE</> or <command>DELETE</> operation on the same
+       row. This DML operation can silently miss this row if the row now gets
+       deleted from the partition by the first session as part of its
+       <command>UPDATE</> row movement. In such case, the concurrent
+       <command>UPDATE</>/<command>DELETE</>, being unaware of the row
+       movement, interprets that the row has just been deleted so there is
+       nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</>/<command>DELETE</> on this new row version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 9dcbbd0..1b61bde 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index f5f74af..99b271f 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 07fdf66..f5dec3c 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1191,7 +1191,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1204,8 +1205,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1214,14 +1215,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2525,6 +2526,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8006df3..404ce89 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2466,6 +2466,8 @@ CopyFrom(CopyState cstate)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2736,7 +2738,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 8d0345c..f3b8fc6 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to another
+ *  partition (to capture NEW row). This is done separately because DELETE and
+ *  INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5506,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5506,7 +5531,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 9689429..e7ff3bf 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -104,9 +104,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
  * not appear to be any good header to put it into, given the structures that
@@ -1849,15 +1846,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1885,52 +1877,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1938,7 +1944,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2054,8 +2061,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3239,6 +3247,10 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels.
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
@@ -3262,6 +3274,8 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
@@ -3274,7 +3288,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3288,6 +3304,33 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a
+		 * new result rel. The per-subplan resultrels and the resultrels of
+		 * the leaf partitions are both in the same canonical order. So while
+		 * going through the leaf partition oids, we need to keep track of the
+		 * next per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by
+		 * one while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -3296,20 +3339,70 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -3319,14 +3412,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3342,9 +3431,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to
+	 * the last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
@@ -3370,8 +3468,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0027d21..940ae29 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,6 +64,11 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
 
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -240,6 +246,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -265,6 +303,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -281,17 +320,49 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partition rel needs
+		 * to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partition (which happens for
+		 * UPDATE), we should convert the tuple into root partition's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this
+		 * resultRel, we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[].
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			/* resultRelInfo must be one of the per-subplan result rels. */
+			Assert(resultRelInfo >= mtstate->resultRelInfo &&
+				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -331,7 +402,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -345,29 +416,17 @@ ExecInsert(ModifyTableState *mtstate,
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -485,7 +544,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -621,9 +680,31 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -677,6 +758,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -684,6 +767,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
+
+	if (delete_skipped)
+		*delete_skipped = true;
 
 	/*
 	 * get information on the (current) result relation
@@ -848,12 +935,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -946,6 +1060,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1042,12 +1157,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip them
+		 * below.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1468,6 +1653,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partition. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1475,6 +1699,11 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
+	Oid			cur_reloid = InvalidOid;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
@@ -1489,71 +1718,108 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 									   RelationGetRelid(targetRelInfo->ri_RelationDesc),
 									   CMD_UPDATE);
 
+	if (mtstate->mt_transition_capture == NULL &&
+		mtstate->mt_oc_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.  (We can share these maps
 	 * between the regular and ON CONFLICT cases.)
 	 */
-	if (mtstate->mt_transition_capture != NULL ||
-		mtstate->mt_oc_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next
+	 * plan.  (INSERT operations set it every time, so we need not update
+	 * mtstate->mt_oc_transition_capture here.)
+	 */
+	if (mtstate->mt_transition_capture &&
+		mtstate->mt_persubplan_childparent_maps)
 	{
-		int			numResultRelInfos;
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
-							 mtstate->mt_nplans);
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
 
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
+	{
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+	if (mtstate->mt_nplans > 0)
+		cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (cur_reloid == RelationGetRelid(resultRelInfo->ri_RelationDesc))
 		{
+			Assert(update_rri_index < mtstate->mt_nplans);
+
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
+
 			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
+			 * If this was the last UPDATE resultrel, indicate that by
+			 * invalidating the cur_reloid.
 			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
-
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+			if (update_rri_index == mtstate->mt_nplans)
+				cur_reloid = InvalidOid;
+			else
+				cur_reloid = RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc);
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
-
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
-		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1659,15 +1925,15 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1783,7 +2049,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1828,9 +2095,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1903,6 +2173,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1940,9 +2219,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo **partitions;
@@ -1952,6 +2238,9 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   (operation == CMD_UPDATE ?
+										mtstate->resultRelInfo : NULL),
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
@@ -1963,11 +2252,31 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
+	 * Construct mapping from each of the resultRelInfo attnos to the root
+	 * attno. This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partition
+	 * descriptor. In such case we need to convert tuples to the root
+	 * partition tuple descriptor, because the search for destination
+	 * partition starts from the root. Skip this setup if it's not a partition
+	 * key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
 	 * Build state for collecting transition tuples.  This requires having a
 	 * valid trigger query context, so skip it in explain-only mode.
 	 */
@@ -2004,50 +2313,62 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
-			List	   *mapped_wcoList;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
 		}
 	}
@@ -2059,7 +2380,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2096,22 +2417,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2356,6 +2690,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2390,11 +2725,23 @@ ExecEndModifyTable(ModifyTableState *node)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c1a83ca..d8caa5ac 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 43d6206..d867b80 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2100,6 +2101,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ccb6a1f..f6236bf 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index c802d61..e4e78e5 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2362,6 +2363,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6418,6 +6420,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6444,6 +6447,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ecdd728..b37d8a8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -111,6 +111,10 @@ typedef struct
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
+static void get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols);
 static void inheritance_planner(PlannerInfo *root);
 static void grouping_planner(PlannerInfo *root, bool inheritance_update,
 				 double tuple_fraction);
@@ -1026,6 +1030,40 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
 }
 
 /*
+ * get_all_partition_cols
+ *	  Get attribute numbers of all partition key columns of all the partitioned
+ *    tables.
+ *
+ * All the child partition attribute numbers are converted to the root
+ * partitioned table.
+ */
+static void
+get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols)
+{
+	ListCell   *lc;
+	Oid			root_relid = getrelid(root_rti, rtables);
+	Relation	root_rel;
+
+	/* The caller must have already locked all the partitioned tables. */
+	root_rel = heap_open(root_relid, NoLock);
+	*all_part_cols = NULL;
+	foreach(lc, partitioned_rels)
+	{
+		Index		rti = lfirst_int(lc);
+		Oid			relid = getrelid(rti, rtables);
+		Relation	part_rel = heap_open(relid, NoLock);
+
+		pull_child_partition_columns(all_part_cols, part_rel, root_rel);
+		heap_close(part_rel, NoLock);
+	}
+
+	heap_close(root_rel, NoLock);
+}
+
+/*
  * inheritance_planner
  *	  Generate Paths in the case where the result relation is an
  *	  inheritance set.
@@ -1070,6 +1108,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1140,10 +1179,23 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
 		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		/*
+		 * Retrieve the partition key columns of all the partitioned tables,
+		 * so as to check whether any of the columns being updated is
+		 * a partition key of any of the partition tables.
+		 */
+		get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+							 partitioned_rels, &all_part_cols);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/*
@@ -1481,6 +1533,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2098,6 +2151,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6128,6 +6182,10 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 2d491eb..8dbc361 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3170,6 +3170,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3183,6 +3185,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3250,6 +3253,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 945ac02..994f6f7 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -82,8 +82,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
@@ -101,6 +101,9 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4ecf0d..85a2529 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,6 +210,8 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 01ceeef..b96aecc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -980,15 +980,32 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index dd74efa..c414755 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index e085cef..d19b7f1 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2109,6 +2110,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..4b4485f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index cef70b1..a49980b 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,367 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (null, 85, b, 15, 105).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, b, 7, 2).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
+Partition constraint: (NOT (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +566,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +629,110 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
--- cleanup
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 drop table range_parted;
 drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 66d1fec..0ec5bb2 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,229 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +338,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +367,82 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
--- cleanup
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
 drop table range_parted;
 drop table list_parted;

#181

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Langote (#179)

1 attachment(s)

Re: UPDATE of partition key

On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+ * the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another
partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to
another
+ *  partition (to capture NEW row). This is done separately because
DELETE and
+ *  INSERT happen on different tables.

Extra space at the beginning from the 2nd line onwards.

Just observed that the existing comment lines use tab instead of
spaces. I have now used tab for the new comments, instead of the
multiple spaces.

+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
== NULL))))

Is there some reason why a bitwise operator is used here?

That exact condition means that the function is called for transition
capture for updated rows being moved to another partition. For this
scenario, either the oldtup or the newtup is NULL. I wanted to exactly
capture that condition there. I think the bitwise operator is more
user-friendly in emphasizing the point that it is indeed an "either a
or b, not both" condition.

+ * 'update_rri' has the UPDATE per-subplan result rels.

Could you explain why they are being received as input here?

Added the explanation in the comments.

+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *     with on entry for every leaf partition (required to convert input
tuple
+ *     based on the root table's rowtype to a leaf partition's rowtype after
+ *     tuple routing is done)
Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that
they are maps needed for "tuple conversion". And the other field holding
the reverse map as leaf_rev_tupconv_maps. Either that or use underscores
to separate words, but then it gets too long I guess.

In master branch, now this param is already there with the name
"tup_conv_maps". In the rebased version in the earlier mail, I haven't
again changed it. I think "tup_conv_maps" looks clear enough.

+       tuple = ConvertPartitionTupleSlot(mtstate,
+
mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+
The 2nd line here seems to have gone over 80 characters.

ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
interface. I guess it could simply have the following interface:

static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
HeapTuple tuple, bool is_update);

And figure out, based on the value of is_update, which map to use and
which slot to set *p_new_slot to (what is now "new_slot" argument).
You're getting mtstate here anyway, which contains all the information you
need here. It seems better to make that (selecting which map and which
slot) part of the function's implementation if we're having this function
at all, imho. Maybe I'm missing some details there, but my point still
remains that we should try to put more logic in that function instead of
it just do the mechanical tuple conversion.

I tried to see how the interface would look if we do that way. Here is
how the code looks :

static TupleTableSlot *
ConvertPartitionTupleSlot(ModifyTableState *mtstate,
bool for_update_tuple_routing,
int map_index,
HeapTuple *tuple,
TupleTableSlot *slot)
{
TupleConversionMap *map;
TupleTableSlot *new_slot;

if (for_update_tuple_routing)
{
map = mtstate->mt_persubplan_childparent_maps[map_index];
new_slot = mtstate->mt_rootpartition_tuple_slot;
}
else
{
map = mtstate->mt_perleaf_parentchild_maps[map_index];
new_slot = mtstate->mt_partition_tuple_slot;
}

if (!map)
return slot;

*tuple = do_convert_tuple(*tuple, map);

/*
* Change the partition tuple slot descriptor, as per converted tuple.
*/
ExecSetSlotDescriptor(new_slot, map->outdesc);
ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);

return new_slot;
}

It looks like the interface does not much simplify, and above that, we
have more number of lines in that function. Also, the caller anyway
has to be aware whether the map_index is the index into the leaf
partitions or the update subplans. So it is not like the caller does
not have to be aware about whether the mapping should be
mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

+         * We have already checked partition constraints above, so skip them
+         * below.

How about: ", so skip checking here."?

Ok I have made it this way :
* We have already checked partition constraints above, so skip
* checking them here.

ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
try to reuse the per-subplan child-to-parent map as per-leaf
child-to-parent map could be simplified a bit. I mean the following code:
+    /*
+     * But for Updates, we can share the per-subplan maps with the per-leaf
+     * maps.
+     */
+    update_rri_index = 0;
+    update_rri = mtstate->resultRelInfo;
+    if (mtstate->mt_nplans > 0)
+        cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
-        /* Choose the right set of partitions */
-        if (mtstate->mt_partition_dispatch_info != NULL)
+    for (i = 0; i < numResultRelInfos; ++i)
+    {
<snip>
How about (pseudo-code):

j = 0;
for (i = 0; i < n_leaf_parts; i++)
{
if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
{
leaf_childparent_map[i] = subplan_childparent_map[j];
j++;
}
else
{
leaf_childparent_map[i] = new map
}
}

I think the above would also be useful in ExecSetupPartitionTupleRouting()
where you've added similar code to try to reuse per-subplan ResultRelInfos.

Did something like that in the attached patch. Please have a look.
After we conclude on that, will do the same for
ExecSetupPartitionTupleRouting() as well.

In ExecInitModifyTable(), can we try to minimize the number of places
where update_tuple_routing_needed is being set. Currently, it's being set
in 3 places:

Will see if we can skip some checks (TODO).

In the following:
ExecSetupPartitionTupleRouting(rel,
+                                       (operation == CMD_UPDATE ?
+                                        mtstate->resultRelInfo : NULL),
+                                       (operation == CMD_UPDATE ? nplans
: 0),
Can the second parameter be made to not span two lines? It was a bit hard
for me to see that there two new parameters.

I think it is safe to just pass mtstate->resultRelInfo. Inside
ExecSetupPartitionTupleRouting() we should anyways check only the
nplans param (and not update_rri) to decide whether it is for insert
or update. So did the same.

+ * Construct mapping from each of the resultRelInfo attnos to the root

Maybe it's odd to say "resultRelInfo attno", because it's really the
underlying partition whose attnos we're talking about as being possibly
different from the root table's attnos.

Changed : resultRelInfo => partition

+ * descriptor. In such case we need to convert tuples to the root

s/In such case/In such a case,/

Done.

By the way, I've seen in a number of places that the patch calls "root
table" a partition. Not just in comments, but also a variable appears to
be given a name which contains rootpartition. I can see only one instance
where root is called a partition in the existing source code, but it seems
to have been introduced only recently:

allpaths.c:1333: * A root partition will already have a

Changed to either this :
root partition => root partitioned table
or this if we have to refer to it too often :
root partition => root

+         * qual for each partition. Note that, if there are SubPlans in
there,
+         * they all end up attached to the one parent Plan node.
The sentence starting with "Note that, " is a bit unclear.
+        Assert(update_tuple_routing_needed ||
+               (operation == CMD_INSERT &&
+                list_length(node->withCheckOptionLists) == 1 &&
+                mtstate->mt_nplans == 1));
The comment I complained about above is perhaps about this Assert.
-            List       *mapped_wcoList;
+            List       *mappedWco;
Not sure why this rename. After this rename, it's now inconsistent with
the code above which handles non-partitioned case, which still calls it
wcoList. Maybe, because you introduced firstWco and then this line:

+ firstWco = linitial(node->withCheckOptionLists);

but note that each member of node->withCheckOptionLists is also a list, so
the original naming. Also, further below, you're assigning mappedWco to
a List * field.

+ resultRelInfo->ri_WithCheckOptions = mappedWco;

Comments on the optimizer changes:

+get_all_partition_cols(List *rtables,

Did you mean rtable?
+        get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+                             partitioned_rels, &all_part_cols);
Two more spaces needed on the 2nd line.
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+                            Relation rel,
+                            Relation parent)
Nitpick: don't we normally list the output argument(s) at the end? Also,
"bitmapset" could be renamed to something that conveys what it contains?
+       if (partattno != 0)
+           child_keycols =
+               bms_add_member(child_keycols,
+                              partattno -
FirstLowInvalidHeapAttributeNumber);
+   }
+   foreach(lc, partexprs)
+   {
Elsewhere (in quite a few places), we don't iterate over partexprs
separately like this, although I'm not saying it is bad, just different
from other places.

get_all_partition_cols() seems to go over the rtable as many times as
there are partitioned tables in the tree. Is there a way to do this work
somewhere else? Maybe when the partitioned_rels list is built in the
first place. But that would require us to make changes to extract
partition columns in some place (prepunion.c) where it's hard to justify
why it's being done there at all.
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
Dead comment? Aha, so here's where all_part_cols was being set before...

+ TupleTableSlot *mt_rootpartition_tuple_slot;

I guess I was complaining about this field where you call root a
partition. Maybe, mt_root_tuple_slot would suffice.

Will get back with the above comments (TODO)

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v21.patchapplication/octet-stream; name=update-partition-key_v21.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 03cbaa6..86c68af 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, during the row movement, the row is still visible for the
+       concurrent session, and it is about to do an <command>UPDATE</command>
+       or <command>DELETE</command> operation on the same row. This DML
+       operation can silently miss this row if the row now gets deleted from
+       the partition by the first session as part of its
+       <command>UPDATE</command> row movement. In such case, the concurrent
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 0e99aa9..bd57f3f 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index b0e160a..a8b000a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 07fdf66..f5dec3c 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1191,7 +1191,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1204,8 +1205,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1214,14 +1215,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2525,6 +2526,77 @@ error_exit:
 }
 
 /*
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*bitmapset =
+			bms_add_member(*bitmapset,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8006df3..404ce89 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2466,6 +2466,8 @@ CopyFrom(CopyState cstate)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2736,7 +2738,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 8d0345c..13e5ab2 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5506,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5506,7 +5531,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 9689429..ec9ce44 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -104,9 +104,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
  * not appear to be any good header to put it into, given the structures that
@@ -1849,15 +1846,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1885,52 +1877,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1938,7 +1944,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2054,8 +2061,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3239,6 +3247,13 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *		instead of allocating new ones while generating the array of all leaf
+ *		partition result rels.
+ *
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
@@ -3262,6 +3277,8 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
@@ -3274,7 +3291,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3288,6 +3307,33 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a
+		 * new result rel. The per-subplan resultrels and the resultrels of
+		 * the leaf partitions are both in the same canonical order. So while
+		 * going through the leaf partition oids, we need to keep track of the
+		 * next per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by
+		 * one while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -3296,20 +3342,70 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -3319,14 +3415,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3342,9 +3434,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to
+	 * the last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
@@ -3370,8 +3471,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0027d21..98d475e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,6 +64,11 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
 
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -240,6 +246,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -265,6 +303,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -281,17 +320,49 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partitioned table
+		 * needs to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partitioned table (which
+		 * happens for UPDATE), we should convert the tuple into root's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[].
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			/* resultRelInfo must be one of the per-subplan result rels. */
+			Assert(resultRelInfo >= mtstate->resultRelInfo &&
+				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_rootpartition_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -331,7 +402,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -345,29 +416,17 @@ ExecInsert(ModifyTableState *mtstate,
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -485,7 +544,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -621,9 +680,31 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -677,6 +758,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -684,6 +767,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
+
+	if (delete_skipped)
+		*delete_skipped = true;
 
 	/*
 	 * get information on the (current) result relation
@@ -848,12 +935,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -946,6 +1060,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1042,12 +1157,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip
+		 * checking them here.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1468,6 +1653,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1475,6 +1699,10 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
@@ -1489,71 +1717,98 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 									   RelationGetRelid(targetRelInfo->ri_RelationDesc),
 									   CMD_UPDATE);
 
+	if (mtstate->mt_transition_capture == NULL &&
+		mtstate->mt_oc_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.  (We can share these maps
 	 * between the regular and ON CONFLICT cases.)
 	 */
-	if (mtstate->mt_transition_capture != NULL ||
-		mtstate->mt_oc_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next
+	 * plan.  (INSERT operations set it every time, so we need not update
+	 * mtstate->mt_oc_transition_capture here.)
+	 */
+	if (mtstate->mt_transition_capture &&
+		mtstate->mt_persubplan_childparent_maps)
 	{
-		int			numResultRelInfos;
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
-							 mtstate->mt_nplans);
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
 
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
+	{
+		for (i = 0; i < numResultRelInfos; ++i)
 		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
+
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (update_rri_index < mtstate->mt_nplans &&
+			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) ==
+			RelationGetRelid(resultRelInfo->ri_RelationDesc))
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
-
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
-		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1659,15 +1914,15 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1783,7 +2038,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1828,9 +2084,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1903,6 +2162,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1940,9 +2208,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo **partitions;
@@ -1952,6 +2227,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   mtstate->resultRelInfo,
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
@@ -1963,11 +2240,30 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_rootpartition_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
+	 * Construct mapping from each of the partition attnos to the root attno.
+	 * This is required when during update row movement the tuple descriptor of
+	 * a source partition does not match the root partitioned table descriptor.
+	 * In such a case we need to convert tuples to the root tuple descriptor,
+	 * because the search for destination partition starts from the root.  Skip
+	 * this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
 	 * Build state for collecting transition tuples.  This requires having a
 	 * valid trigger query context, so skip it in explain-only mode.
 	 */
@@ -2004,50 +2300,62 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *firstWco;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		firstWco = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
-			List	   *mapped_wcoList;
+			List	   *mappedWco;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
-			foreach(ll, mapped_wcoList)
+			mappedWco = map_partition_varattnos(firstWco,
+												firstVarno,
+												partrel, firstResultRel,
+												NULL);
+			foreach(ll, mappedWco)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
 
-			resultRelInfo->ri_WithCheckOptions = mapped_wcoList;
+			resultRelInfo->ri_WithCheckOptions = mappedWco;
 			resultRelInfo->ri_WithCheckOptionExprs = wcoExprs;
 		}
 	}
@@ -2059,7 +2367,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2096,22 +2404,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2356,6 +2677,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2390,11 +2712,23 @@ ExecEndModifyTable(ModifyTableState *node)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_rootpartition_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_rootpartition_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c1a83ca..d8caa5ac 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 43d6206..d867b80 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2100,6 +2101,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ccb6a1f..f6236bf 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index c802d61..e4e78e5 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2362,6 +2363,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6418,6 +6420,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6444,6 +6447,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ecdd728..b37d8a8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -111,6 +111,10 @@ typedef struct
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
+static void get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols);
 static void inheritance_planner(PlannerInfo *root);
 static void grouping_planner(PlannerInfo *root, bool inheritance_update,
 				 double tuple_fraction);
@@ -1026,6 +1030,40 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
 }
 
 /*
+ * get_all_partition_cols
+ *	  Get attribute numbers of all partition key columns of all the partitioned
+ *    tables.
+ *
+ * All the child partition attribute numbers are converted to the root
+ * partitioned table.
+ */
+static void
+get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols)
+{
+	ListCell   *lc;
+	Oid			root_relid = getrelid(root_rti, rtables);
+	Relation	root_rel;
+
+	/* The caller must have already locked all the partitioned tables. */
+	root_rel = heap_open(root_relid, NoLock);
+	*all_part_cols = NULL;
+	foreach(lc, partitioned_rels)
+	{
+		Index		rti = lfirst_int(lc);
+		Oid			relid = getrelid(rti, rtables);
+		Relation	part_rel = heap_open(relid, NoLock);
+
+		pull_child_partition_columns(all_part_cols, part_rel, root_rel);
+		heap_close(part_rel, NoLock);
+	}
+
+	heap_close(root_rel, NoLock);
+}
+
+/*
  * inheritance_planner
  *	  Generate Paths in the case where the result relation is an
  *	  inheritance set.
@@ -1070,6 +1108,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1140,10 +1179,23 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
 		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		/*
+		 * Retrieve the partition key columns of all the partitioned tables,
+		 * so as to check whether any of the columns being updated is
+		 * a partition key of any of the partition tables.
+		 */
+		get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+							 partitioned_rels, &all_part_cols);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/*
@@ -1481,6 +1533,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2098,6 +2151,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6128,6 +6182,10 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
  *		Returns a list of the RT indexes of the partitioned child relations
  *		with rti as the root parent RT index.
  *
+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 2d491eb..8dbc361 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3170,6 +3170,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3183,6 +3185,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3250,6 +3253,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 945ac02..994f6f7 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -82,8 +82,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
@@ -101,6 +101,9 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Bitmapset **bitmapset,
+							 Relation rel,
+							 Relation parent);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4ecf0d..85a2529 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,6 +210,8 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 52d3532..1e1eb1e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -980,15 +980,32 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_rootpartition_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index dd74efa..c414755 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index e085cef..d19b7f1 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2109,6 +2110,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..4b4485f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index cef70b1..a49980b 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,367 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (null, 85, b, 15, 105).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, b, 7, 2).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
+Partition constraint: (NOT (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +566,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +629,110 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
--- cleanup
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 drop table range_parted;
 drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 66d1fec..0ec5bb2 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,229 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +338,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +367,82 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
--- cleanup
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
 drop table range_parted;
 drop table list_parted;

#182

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#181)

1 attachment(s)

Re: UPDATE of partition key

Below I have addressed the remaining review comments :

On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

In ExecInitModifyTable(), can we try to minimize the number of places
where update_tuple_routing_needed is being set. Currently, it's being set
in 3 places:

I think the way it's done seems ok. For each resultRelInfo,
update_tuple_routing_needed is updated in case that resultRel has
partition cols changed. And at that point, we don't have rel opened,
so we can't check if that rel is partitioned. So another check is
required outside of the loop.

+         * qual for each partition. Note that, if there are SubPlans in
there,
+         * they all end up attached to the one parent Plan node.

The sentence starting with "Note that, " is a bit unclear.

+        Assert(update_tuple_routing_needed ||
+               (operation == CMD_INSERT &&
+                list_length(node->withCheckOptionLists) == 1 &&
+                mtstate->mt_nplans == 1));

The comment I complained about above is perhaps about this Assert.

That is an existing comment. On HEAD, the "parent Plan" refers to
mtstate->mt_plans[0]. Now in the patch, for the parent node in
ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the
parent plan refers to this mtstate node.

BTW, the reason I had changed the parent node to mtstate->ps is :
Other places in that code use mtstate->ps while initializing
expressions :

/*
* Build a projection for each result rel.
*/
resultRelInfo->ri_projectReturning =
ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
resultRelInfo->ri_RelationDesc->rd_att);

...........

/* build DO UPDATE WHERE clause expression */
if (node->onConflictWhere)
{
ExprState *qualexpr;

qualexpr = ExecInitQual((List *) node->onConflictWhere,
&mtstate->ps);
....
}

I think wherever we initialize expressions belonging to a plan, we
should use that plan as the parent. WithCheckOptions are fields of
ModifyTableState.

-            List       *mapped_wcoList;
+            List       *mappedWco;
Not sure why this rename. After this rename, it's now inconsistent with
the code above which handles non-partitioned case, which still calls it
wcoList. Maybe, because you introduced firstWco and then this line:

+ firstWco = linitial(node->withCheckOptionLists);

but note that each member of node->withCheckOptionLists is also a list, so
the original naming. Also, further below, you're assigning mappedWco to
a List * field.

+ resultRelInfo->ri_WithCheckOptions = mappedWco;

Done. Reverted mappedWco to mapped_wcoList. And firstWco to first_wcoList.

Comments on the optimizer changes:

+get_all_partition_cols(List *rtables,

Did you mean rtable?

I did mean rtables. It's a list of rtables.

+        get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+                             partitioned_rels, &all_part_cols);

Two more spaces needed on the 2nd line.

Done.

+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+                            Relation rel,
+                            Relation parent)
Nitpick: don't we normally list the output argument(s) at the end?

Agreed. Done.

Also, "bitmapset" could be renamed to something that conveys what it contains?

Renamed it to partcols

+       if (partattno != 0)
+           child_keycols =
+               bms_add_member(child_keycols,
+                              partattno -
FirstLowInvalidHeapAttributeNumber);
+   }
+   foreach(lc, partexprs)
+   {
Elsewhere (in quite a few places), we don't iterate over partexprs
separately like this, although I'm not saying it is bad, just different
from other places.

I think you are suggesting we do it like how it's done in
is_partition_attr(). Can you please let me know other places we do
this same way ? I couldn't find.

get_all_partition_cols() seems to go over the rtable as many times as
there are partitioned tables in the tree. Is there a way to do this work
somewhere else? Maybe when the partitioned_rels list is built in the
first place. But that would require us to make changes to extract
partition columns in some place (prepunion.c) where it's hard to justify
why it's being done there at all.

See below ...

+ * If all_part_cols_p is non-NULL, *all_part_cols_p is set to a bitmapset
+ * of all partitioning columns used by the partitioned table or any
+ * descendent.
+ *

Dead comment?

Removed.

Aha, so here's where all_part_cols was being set before...

Yes, and we used to have PartitionedChildRelInfo.all_part_cols field
for that. We used to populate that while traversing through the
partition tree in expand_inherited_rtentry(). I agreed with Dilip's
opinion that this would unnecessarily add up some processing even when
the query is not a DML. And also, we don't have to have
PartitionedChildRelInfo.all_part_cols. For the earlier implementation,
check v18 patch or earlier versions.

+ TupleTableSlot *mt_rootpartition_tuple_slot;

I guess I was complaining about this field where you call root a
partition. Maybe, mt_root_tuple_slot would suffice.

Done.

Attached v22 patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v22.patchapplication/octet-stream; name=update-partition-key_v22.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 03cbaa6..86c68af 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2993,6 +2993,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3285,9 +3290,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, during the row movement, the row is still visible for the
+       concurrent session, and it is about to do an <command>UPDATE</command>
+       or <command>DELETE</command> operation on the same row. This DML
+       operation can silently miss this row if the row now gets deleted from
+       the partition by the first session as part of its
+       <command>UPDATE</command> row movement. In such case, the concurrent
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 0e99aa9..bd57f3f 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index b0e160a..a8b000a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 07fdf66..17cabf6 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1191,7 +1191,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1204,8 +1205,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1214,14 +1215,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2525,6 +2526,79 @@ error_exit:
 }
 
 /*
+ * pull_child_partition_columns
+ *
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the 'partcols' bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*partcols =
+			bms_add_member(*partcols,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8006df3..404ce89 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2466,6 +2466,8 @@ CopyFrom(CopyState cstate)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
@@ -2736,7 +2738,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 8d0345c..13e5ab2 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5506,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5506,7 +5531,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 9689429..ec9ce44 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -104,9 +104,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
  * not appear to be any good header to put it into, given the structures that
@@ -1849,15 +1846,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1885,52 +1877,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1938,7 +1944,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2054,8 +2061,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3239,6 +3247,13 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *		instead of allocating new ones while generating the array of all leaf
+ *		partition result rels.
+ *
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
@@ -3262,6 +3277,8 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
@@ -3274,7 +3291,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	ResultRelInfo *cur_update_rri;
+	Oid			cur_reloid = InvalidOid;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3288,6 +3307,33 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a
+		 * new result rel. The per-subplan resultrels and the resultrels of
+		 * the leaf partitions are both in the same canonical order. So while
+		 * going through the leaf partition oids, we need to keep track of the
+		 * next per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set the position of cur_update_rri to the first
+		 * per-subplan result rel, and then shift it as we find them one by
+		 * one while scanning the leaf partition oids.
+		 */
+		cur_update_rri = update_rri;
+		cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -3296,20 +3342,70 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (cur_reloid == leaf_oid)
+			{
+				Assert(cur_update_rri <= update_rri + num_update_rri - 1);
+
+				leaf_part_rri = cur_update_rri;
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				cur_update_rri++;
+
+				/*
+				 * If this was the last UPDATE resultrel, indicate that by
+				 * invalidating the cur_reloid.
+				 */
+				if (cur_update_rri == update_rri + num_update_rri)
+					cur_reloid = InvalidOid;
+				else
+					cur_reloid = RelationGetRelid(cur_update_rri->ri_RelationDesc);
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -3319,14 +3415,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3342,9 +3434,18 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions; so cur_update_rri should be positioned just next to
+	 * the last per-subplan resultrel.
+	 */
+	Assert(num_update_rri == 0 ||
+		   (cur_reloid == InvalidOid &&
+			cur_update_rri == update_rri + num_update_rri));
 }
 
 /*
@@ -3370,8 +3471,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0027d21..d2b456c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,6 +64,11 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
 
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -240,6 +246,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -265,6 +303,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -281,17 +320,49 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partitioned table
+		 * needs to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partitioned table (which
+		 * happens for UPDATE), we should convert the tuple into root's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[].
+		 */
+		if (rootResultRelInfo != resultRelInfo &&
+			mtstate->mt_persubplan_childparent_maps != NULL)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+
+			/* resultRelInfo must be one of the per-subplan result rels. */
+			Assert(resultRelInfo >= mtstate->resultRelInfo &&
+				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
+
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  mtstate->mt_persubplan_childparent_maps[map_index],
+											  tuple,
+											  mtstate->mt_root_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_perleaf_parentchild_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -331,7 +402,7 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 			}
 			else
 			{
@@ -345,29 +416,17 @@ ExecInsert(ModifyTableState *mtstate,
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_perleaf_childparent_maps[leaf_part_index];
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -485,7 +544,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -621,9 +680,31 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -677,6 +758,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -684,6 +767,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
+
+	if (delete_skipped)
+		*delete_skipped = true;
 
 	/*
 	 * get information on the (current) result relation
@@ -848,12 +935,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -946,6 +1060,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1042,12 +1157,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip
+		 * checking them here.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1468,6 +1653,45 @@ fireASTriggers(ModifyTableState *node)
 }
 
 /*
+ * Set up per subplan tuple conversion map from child partition to root
+ * partitioned table. The map is needed for collecting transition tuples for
+ * AFTER triggers, and for UPDATE row movement.
+ */
+static void
+ExecSetupPerSubplanChildParentMap(ModifyTableState *mtstate)
+{
+	TupleConversionMap **tup_conv_maps;
+	TupleDesc	outdesc;
+	ResultRelInfo *resultRelInfo;
+	ResultRelInfo *rootRelInfo;
+	int			nplans = mtstate->mt_nplans;
+	int			i;
+
+	Assert(mtstate->operation != CMD_INSERT);
+
+	if (mtstate->mt_persubplan_childparent_maps != NULL)
+		return;
+
+	rootRelInfo = getASTriggerResultRelInfo(mtstate);
+
+	mtstate->mt_persubplan_childparent_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * nplans);
+
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	resultRelInfo = mtstate->resultRelInfo;
+	tup_conv_maps = mtstate->mt_persubplan_childparent_maps;
+	for (i = 0; i < nplans; i++)
+	{
+		TupleDesc	indesc = RelationGetDescr(resultRelInfo[i].ri_RelationDesc);
+
+		tup_conv_maps[i] = convert_tuples_by_name(indesc, outdesc,
+												  gettext_noop("could not convert row type"));
+	}
+}
+
+/*
  * Set up the state needed for collecting transition tuples for AFTER
  * triggers.
  */
@@ -1475,6 +1699,10 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo **resultRelInfos;
+	int			numResultRelInfos;
+	int			update_rri_index = -1;
+	ResultRelInfo *update_rri = mtstate->resultRelInfo;
 	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
@@ -1489,71 +1717,98 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 									   RelationGetRelid(targetRelInfo->ri_RelationDesc),
 									   CMD_UPDATE);
 
+	if (mtstate->mt_transition_capture == NULL &&
+		mtstate->mt_oc_transition_capture == NULL)
+		return;
+
 	/*
-	 * If we found that we need to collect transition tuples then we may also
+	 * Now that we know that we need to collect transition tuples, we may also
 	 * need tuple conversion maps for any children that have TupleDescs that
 	 * aren't compatible with the tuplestores.  (We can share these maps
 	 * between the regular and ON CONFLICT cases.)
 	 */
-	if (mtstate->mt_transition_capture != NULL ||
-		mtstate->mt_oc_transition_capture != NULL)
+
+	/* Make sure per-subplan mapping is there. */
+	if (mtstate->operation != CMD_INSERT)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
+	 * Install the conversion map for the first plan for UPDATE and DELETE
+	 * operations.  It will be advanced each time we switch to the next
+	 * plan.  (INSERT operations set it every time, so we need not update
+	 * mtstate->mt_oc_transition_capture here.)
+	 */
+	if (mtstate->mt_transition_capture &&
+		mtstate->mt_persubplan_childparent_maps)
 	{
-		int			numResultRelInfos;
+		mtstate->mt_transition_capture->tcs_map =
+			mtstate->mt_persubplan_childparent_maps[0];
+	}
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
-							 mtstate->mt_nplans);
+	/* If no tuple routing, return without setting up per-leaf-partition map */
+	if (mtstate->mt_partition_dispatch_info == NULL)
+		return;
 
-		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
-		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+	numResultRelInfos = mtstate->mt_num_partitions;
+	resultRelInfos = mtstate->mt_partitions;
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_perleaf_childparent_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+	/* For Inserts, just create all new map elements. */
+	if (mtstate->operation == CMD_INSERT)
+	{
+		for (i = 0; i < numResultRelInfos; ++i)
 		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+		}
+		return;
+	}
+
+	/*
+	 * But for Updates, we can share the per-subplan maps with the per-leaf
+	 * maps.
+	 */
+	update_rri_index = 0;
+	update_rri = mtstate->resultRelInfo;
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		ResultRelInfo *resultRelInfo = mtstate->mt_partitions[i];
+
+		/* Is this leaf partition present in the update resultrel ? */
+		if (update_rri_index < mtstate->mt_nplans &&
+			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) ==
+			RelationGetRelid(resultRelInfo->ri_RelationDesc))
+		{
+			mtstate->mt_perleaf_childparent_maps[i] =
+				mtstate->mt_persubplan_childparent_maps[update_rri_index];
+			update_rri_index++;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
 		}
 		else
 		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
-
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+			mtstate->mt_perleaf_childparent_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
+									   RelationGetDescr(targetRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
 		}
-
-		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
-		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
 	}
+
+	/*
+	 * We should have found all the per-subplan reloids in the leaf
+	 * partitions.
+	 */
+	Assert(update_rri_index == mtstate->mt_nplans);
 }
 
 /* ----------------------------------------------------------------
@@ -1659,15 +1914,15 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
+					Assert(node->mt_persubplan_childparent_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						node->mt_persubplan_childparent_maps[node->mt_whichplan];
 				}
 				continue;
 			}
@@ -1783,7 +2038,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1828,9 +2084,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1903,6 +2162,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1940,9 +2208,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo **partitions;
@@ -1952,6 +2227,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   mtstate->resultRelInfo,
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
@@ -1963,11 +2240,30 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_perleaf_parentchild_maps = partition_tupconv_maps;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
+	 * Construct mapping from each of the partition attnos to the root attno.
+	 * This is required when during update row movement the tuple descriptor of
+	 * a source partition does not match the root partitioned table descriptor.
+	 * In such a case we need to convert tuples to the root tuple descriptor,
+	 * because the search for destination partition starts from the root.  Skip
+	 * this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupPerSubplanChildParentMap(mtstate);
+
+	/*
 	 * Build state for collecting transition tuples.  This requires having a
 	 * valid trigger query context, so skip it in explain-only mode.
 	 */
@@ -2004,26 +2300,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2032,17 +2331,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2059,7 +2367,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2096,22 +2404,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2356,6 +2677,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2390,11 +2712,23 @@ ExecEndModifyTable(ModifyTableState *node)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_root_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c1a83ca..d8caa5ac 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 43d6206..d867b80 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2100,6 +2101,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ccb6a1f..f6236bf 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index c802d61..e4e78e5 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2362,6 +2363,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6418,6 +6420,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6444,6 +6447,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d58635c..1ed4fa5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -111,6 +111,10 @@ typedef struct
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
+static void get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols);
 static void inheritance_planner(PlannerInfo *root);
 static void grouping_planner(PlannerInfo *root, bool inheritance_update,
 				 double tuple_fraction);
@@ -1048,6 +1052,40 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
 }
 
 /*
+ * get_all_partition_cols
+ *	  Get attribute numbers of all partition key columns of all the partitioned
+ *    tables.
+ *
+ * All the child partition attribute numbers are converted to the root
+ * partitioned table.
+ */
+static void
+get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols)
+{
+	ListCell   *lc;
+	Oid			root_relid = getrelid(root_rti, rtables);
+	Relation	root_rel;
+
+	/* The caller must have already locked all the partitioned tables. */
+	root_rel = heap_open(root_relid, NoLock);
+	*all_part_cols = NULL;
+	foreach(lc, partitioned_rels)
+	{
+		Index		rti = lfirst_int(lc);
+		Oid			relid = getrelid(rti, rtables);
+		Relation	part_rel = heap_open(relid, NoLock);
+
+		pull_child_partition_columns(part_rel, root_rel, all_part_cols);
+		heap_close(part_rel, NoLock);
+	}
+
+	heap_close(root_rel, NoLock);
+}
+
+/*
  * inheritance_planner
  *	  Generate Paths in the case where the result relation is an
  *	  inheritance set.
@@ -1092,6 +1130,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1162,10 +1201,23 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
 		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		/*
+		 * Retrieve the partition key columns of all the partitioned tables,
+		 * so as to check whether any of the columns being updated is
+		 * a partition key of any of the partition tables.
+		 */
+		get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+							   partitioned_rels, &all_part_cols);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/*
@@ -1503,6 +1555,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2120,6 +2173,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 2d491eb..8dbc361 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3170,6 +3170,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3183,6 +3185,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3250,6 +3253,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 945ac02..a9feecb 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -82,8 +82,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
@@ -101,6 +101,9 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4ecf0d..85a2529 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,6 +210,8 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
@@ -218,6 +223,8 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 52d3532..833c327 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -980,15 +980,32 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
+
+	/*
+	 * Per partition conversion map to convert tuples from root to leaf
+	 * partition
+	 */
+	TupleConversionMap **mt_perleaf_parentchild_maps;
+
+	/*
+	 * Per partition conversion map to convert tuples from leaf partition to
+	 * root
+	 */
+	TupleConversionMap **mt_perleaf_childparent_maps;
+
+	/*
+	 * Per subplan conversion map to convert tuples from leaf partition to
+	 * root partitioned table
+	 */
+	TupleConversionMap **mt_persubplan_childparent_maps;
+
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_root_tuple_slot;
+
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index dd74efa..c414755 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index e085cef..d19b7f1 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2109,6 +2110,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..4b4485f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index cef70b1..a49980b 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,367 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (null, 85, b, 15, 105).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, b, 7, 2).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
+Partition constraint: (NOT (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +566,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +629,110 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
--- cleanup
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 drop table range_parted;
 drop table list_parted;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 66d1fec..0ec5bb2 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,229 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +338,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +367,82 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
--- cleanup
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
 drop table range_parted;
 drop table list_parted;

#183

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 8 years ago

In reply to: Amit Khandekar (#181)

Re: UPDATE of partition key

Hi Amit.

Thanks a lot for updated patches and sorry that I couldn't get to looking
at your emails sooner. Note that I'm replying here to both of your
emails, but looking at only the latest v22 patch.

On 2017/10/24 0:15, Amit Khandekar wrote:

On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
== NULL))))

Is there some reason why a bitwise operator is used here?

That exact condition means that the function is called for transition
capture for updated rows being moved to another partition. For this
scenario, either the oldtup or the newtup is NULL. I wanted to exactly
capture that condition there. I think the bitwise operator is more
user-friendly in emphasizing the point that it is indeed an "either a
or b, not both" condition.

I see. In that case, since this patch adds the new condition, a note
about it in the comment just above would be good, because the situation
you describe here seems to arise only during update-tuple-routing, IIUC.

+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ *     with on entry for every leaf partition (required to convert input
tuple
+ *     based on the root table's rowtype to a leaf partition's rowtype after
+ *     tuple routing is done)
Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that
they are maps needed for "tuple conversion". And the other field holding
the reverse map as leaf_rev_tupconv_maps. Either that or use underscores
to separate words, but then it gets too long I guess.
In master branch, now this param is already there with the name
"tup_conv_maps". In the rebased version in the earlier mail, I haven't
again changed it. I think "tup_conv_maps" looks clear enough.

OK.

In the latest patch:

+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *      instead of allocating new ones while generating the array of all leaf
+ *      partition result rels.

Instead of:

"These are re-used instead of allocating new ones while generating the
array of all leaf partition result rels."

how about:

"There is no need to allocate a new ResultRellInfo entry for leaf
partitions for which one already exists in this array"

ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
interface. I guess it could simply have the following interface:

static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
HeapTuple tuple, bool is_update);

And figure out, based on the value of is_update, which map to use and
which slot to set *p_new_slot to (what is now "new_slot" argument).
You're getting mtstate here anyway, which contains all the information you
need here. It seems better to make that (selecting which map and which
slot) part of the function's implementation if we're having this function
at all, imho. Maybe I'm missing some details there, but my point still
remains that we should try to put more logic in that function instead of
it just do the mechanical tuple conversion.

I tried to see how the interface would look if we do that way. Here is
how the code looks :

static TupleTableSlot *
ConvertPartitionTupleSlot(ModifyTableState *mtstate,
bool for_update_tuple_routing,
int map_index,
HeapTuple *tuple,
TupleTableSlot *slot)
{
TupleConversionMap *map;
TupleTableSlot *new_slot;

if (for_update_tuple_routing)
{
map = mtstate->mt_persubplan_childparent_maps[map_index];
new_slot = mtstate->mt_rootpartition_tuple_slot;
}
else
{
map = mtstate->mt_perleaf_parentchild_maps[map_index];
new_slot = mtstate->mt_partition_tuple_slot;
}

if (!map)
return slot;

*tuple = do_convert_tuple(*tuple, map);

/*
* Change the partition tuple slot descriptor, as per converted tuple.
*/
ExecSetSlotDescriptor(new_slot, map->outdesc);
ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);

return new_slot;
}

It looks like the interface does not much simplify, and above that, we
have more number of lines in that function. Also, the caller anyway
has to be aware whether the map_index is the index into the leaf
partitions or the update subplans. So it is not like the caller does
not have to be aware about whether the mapping should be
mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

Hmm, I think we should try to make it so that the caller doesn't have to
be aware of that. And by caller I guess you mean ExecInsert(), which
should not be a place, IMHO, where to try to introduce a lot of new logic
specific to update tuple routing. ISTM, ModifyTableState now has one too
many TupleConversionMap pointer arrays after the patch, creating the need
to choose from in the first place. AIUI -

* mt_perleaf_parentchild_maps:

- each entry is a map to convert root parent's tuples to a given leaf
partition's format

- used to be called mt_partition_tupconv_maps and is needed when tuple-
routing is in use; for both INSERT and UPDATE with tuple-routing

- as many entries in the array as there are leaf partitions and stored
in the partition bound order

* mt_perleaf_childparent_maps:

- each entry is a map to convert a leaf partition's tuples to the root
parent's format

- newly added by this patch and seems to be needed for UPDATE with
tuple-routing for two needs: 1. tuple-routing should start with a
tuple in root parent format whereas the tuple received is in leaf
partition format when ExecInsert() called for update-tuple-routing (by
ExecUpdate), 2. after tuple-routing, we must capture the tuple
inserted into the partition in the transition tuplestore which accepts
tuples in root parent's format

- as many entries in the array as there are leaf partitions and stored
in the partition bound order

* mt_persubplan_childparent_maps:

- each entry is a map to convert a child table's tuples to the root
parent's format

- used to be called mt_transition_tupconv_maps and needed for converting
child tuples to the root parent's format when storing them in the
transition tuplestore which accepts tuples in root parent's format

- as many entries in the array as there are sub-plans in mt_plans and
stored in either the partition bound order or unknown order (the
latter in the regular inheritance case)

I think we could combine the last two into one. The only apparent reason
for them to be separate seems to be that the subplan array might contain
less entries than perleaf array and ExecInsert() has only enough
information to calculate the offset of a map in the persubplan array.
That is, resultRelInfo of leaf partition that ExecInsert starts with in
the update-tuple-routing case comes from mtstate->resultRelInfo array
which contains only mt_nplans entries. So, if we only have the array with
entries for *all* partitions, it's hard to get the offset of the map to
use in that array.

I suggest we don't add a new map array and a significant amount of new
code to initialize the same and to implement the logic to choose the
correct array to get the map from. Instead, we could simply add an array
of integers with mt_nplans entries. Each entry is an offset of a given
sub-plan in the array containing entries of something for *all*
partitions. Since, we are teaching ExecSetupPartitionTupleRouting() to
reuse ResultRelInfos from mtstate->resultRelInfos, there is a suitable
place to construct such array. Let's say the array is called
mt_subplan_partition_offsets[]. Let ExecSetupPartitionTupleRouting() also
initialize the parent-to-partition maps for *all* partitions, in the
update-tuple-routing case. Then add a quick-return check in
ExecSetupTransitionCaptureState() to see if the map has already been set
by ExecSetupPartitionTupleRouting(). Since we're using the same map for
two purposes, we could rename mt_transition_tupconv_maps to something that
doesn't bind it to its use only for transition tuple capture.

With that, now there are no persubplan and perleaf arrays for ExecInsert()
to pick from to select a map to pass to ConvertPartitionTupleSlot(), or
maybe even no need for the separate function. The tuple-routing code
block in ExecInsert would look like below (writing resultRelInfo as just Rel):

rootRel = (mtstate->rootRel != NULL) ? mtstate->rootRel : Rel

if (rootRel != Rel) /* update tuple-routing active */
{
int subplan_off = Rel - mtstate->Rel[0];
int leaf_off = mtstate->mt_subplan_partition_offsets[subplan_off];

if (mt_transition_tupconv_maps[leaf_off])
{
/*
* Convert to root format using
* mt_transition_tupconv_maps[leaf_off]
*/

slot = mt_root_tuple_slot; /* for tuple-routing */

/* Store the converted tuple into slot */
}
}

/* Existing tuple-routing flow follows */
new_leaf = ExecFindPartition(rootRel, slot, ...)

if (mtstate->transition_capture)
{
transition_capture_map = mt_transition_tupconv_maps[new_leaf]
}

if (mt_partition_tupconv_maps[new_leaf])
{
/*
* Convert to leaf format using mt_partition_tupconv_maps[new_leaf]
*/

slot = mt_partition_tuple_slot;

/* Store the converted tuple into slot */
}

ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
try to reuse the per-subplan child-to-parent map as per-leaf
child-to-parent map could be simplified a bit. I mean the following code:
+    /*
+     * But for Updates, we can share the per-subplan maps with the per-leaf
+     * maps.
+     */
+    update_rri_index = 0;
+    update_rri = mtstate->resultRelInfo;
+    if (mtstate->mt_nplans > 0)
+        cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);
-        /* Choose the right set of partitions */
-        if (mtstate->mt_partition_dispatch_info != NULL)
+    for (i = 0; i < numResultRelInfos; ++i)
+    {
<snip>
How about (pseudo-code):

j = 0;
for (i = 0; i < n_leaf_parts; i++)
{
if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
{
leaf_childparent_map[i] = subplan_childparent_map[j];
j++;
}
else
{
leaf_childparent_map[i] = new map
}
}

I think the above would also be useful in ExecSetupPartitionTupleRouting()
where you've added similar code to try to reuse per-subplan ResultRelInfos.
Did something like that in the attached patch. Please have a look.
After we conclude on that, will do the same for
ExecSetupPartitionTupleRouting() as well.

Yeah, ExecSetupTransitionCaptureState() looks better in v22, but as I
explained above, we may not need to change the function so much. The
approach, OTOH, should be adopted for ExecSetupPartitionTupleRouting().

In the following:
ExecSetupPartitionTupleRouting(rel,
+                                       (operation == CMD_UPDATE ?
+                                        mtstate->resultRelInfo : NULL),
+                                       (operation == CMD_UPDATE ? nplans
: 0),
Can the second parameter be made to not span two lines? It was a bit hard
for me to see that there two new parameters.
I think it is safe to just pass mtstate->resultRelInfo. Inside
ExecSetupPartitionTupleRouting() we should anyways check only the
nplans param (and not update_rri) to decide whether it is for insert
or update. So did the same.

OK.

By the way, I've seen in a number of places that the patch calls "root
table" a partition. Not just in comments, but also a variable appears to
be given a name which contains rootpartition. I can see only one instance
where root is called a partition in the existing source code, but it seems
to have been introduced only recently:

allpaths.c:1333: * A root partition will already have a

Changed to either this :
root partition => root partitioned table
or this if we have to refer to it too often :
root partition => root

That seems fine, thanks.

On 2017/10/25 15:10, Amit Khandekar wrote:

On 16 October 2017 at 08:28, Amit Langote wrote:

In ExecInitModifyTable(), can we try to minimize the number of places
where update_tuple_routing_needed is being set. Currently, it's being set
in 3 places:

I think the way it's done seems ok. For each resultRelInfo,
update_tuple_routing_needed is updated in case that resultRel has
partition cols changed. And at that point, we don't have rel opened,
so we can't check if that rel is partitioned. So another check is
required outside of the loop.

I understood why now.

+         * qual for each partition. Note that, if there are SubPlans in
there,
+         * they all end up attached to the one parent Plan node.
The sentence starting with "Note that, " is a bit unclear.
+        Assert(update_tuple_routing_needed ||
+               (operation == CMD_INSERT &&
+                list_length(node->withCheckOptionLists) == 1 &&
+                mtstate->mt_nplans == 1));
The comment I complained about above is perhaps about this Assert.
That is an existing comment.

Sorry, my bad.

On HEAD, the "parent Plan" refers to
mtstate->mt_plans[0]. Now in the patch, for the parent node in
ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the
parent plan refers to this mtstate node.

Hmm, I'm not really sure if doing that (passing mtstate->ps) would be
accurate. In the update tuple routing case, it seems that it's better to
pass the correct parent PlanState pointer to ExecInitQual(), that is, one
corresponding to the partition's sub-plan. At least I get that feeling by
looking at how parent is used downstream to that ExecInitQual() call, but
there *may* not be anything to worry about there after all. I'm unsure.

BTW, the reason I had changed the parent node to mtstate->ps is :
Other places in that code use mtstate->ps while initializing
expressions :

/*
* Build a projection for each result rel.
*/
resultRelInfo->ri_projectReturning =
ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
resultRelInfo->ri_RelationDesc->rd_att);

...........

/* build DO UPDATE WHERE clause expression */
if (node->onConflictWhere)
{
ExprState *qualexpr;

qualexpr = ExecInitQual((List *) node->onConflictWhere,
&mtstate->ps);
....
}

I think wherever we initialize expressions belonging to a plan, we
should use that plan as the parent. WithCheckOptions are fields of
ModifyTableState.

You may be right, but I see for WithCheckOptions initialization
specifically that the non-tuple-routing code passes the actual sub-plan
when initializing the WCO for a given result rel.

Comments on the optimizer changes:

+get_all_partition_cols(List *rtables,

Did you mean rtable?

I did mean rtables. It's a list of rtables.

It's not, AFAIK. rtable (range table) is a list of range table entries,
which is also what seems to get passed to get_all_partition_cols for that
argument (root->parse->rtable, which is not a list of lists).

Moreover, there are no existing instances of this naming within the
planner other than those that this patch introduces:

$ grep rtables src/backend/optimizer/
planner.c:114: static void get_all_partition_cols(List *rtables,
planner.c:1063: get_all_partition_cols(List *rtables,
planner.c:1069: Oid root_relid = getrelid(root_rti, rtables);
planner.c:1078: Oid relid = getrelid(rti, rtables);

OTOH, dependency.c does have rtables, but it's actually a list of range
tables. For example:

dependency.c:1360: context.rtables = list_make1(rtable);

+       if (partattno != 0)
+           child_keycols =
+               bms_add_member(child_keycols,
+                              partattno -
FirstLowInvalidHeapAttributeNumber);
+   }
+   foreach(lc, partexprs)
+   {
Elsewhere (in quite a few places), we don't iterate over partexprs
separately like this, although I'm not saying it is bad, just different
from other places.
I think you are suggesting we do it like how it's done in
is_partition_attr(). Can you please let me know other places we do
this same way ? I couldn't find.

OK, not as many as I thought there would be, but there are following
beside is_partition_attrs():

partition.c: get_range_nulltest()
partition.c: get_qual_for_range()
relcache.c: RelationBuildPartitionKey()

Aha, so here's where all_part_cols was being set before...

Yes, and we used to have PartitionedChildRelInfo.all_part_cols field
for that. We used to populate that while traversing through the
partition tree in expand_inherited_rtentry(). I agreed with Dilip's
opinion that this would unnecessarily add up some processing even when
the query is not a DML. And also, we don't have to have
PartitionedChildRelInfo.all_part_cols. For the earlier implementation,
check v18 patch or earlier versions.

Hmm, I think I have to agree with both you and Dilip that that would add
some redundant processing to other paths.

Attached v22 patch.

Thanks again.

Regards,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#184

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#182)

Re: UPDATE of partition key

On Wed, Oct 25, 2017 at 11:40 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Below I have addressed the remaining review comments :

The changes to trigger.c still make me super-nervous. Hey THOMAS
MUNRO, any chance you could review that part?

+       /* The caller must have already locked all the partitioned tables. */
+       root_rel = heap_open(root_relid, NoLock);
+       *all_part_cols = NULL;
+       foreach(lc, partitioned_rels)
+       {
+               Index           rti = lfirst_int(lc);
+               Oid                     relid = getrelid(rti, rtables);
+               Relation        part_rel = heap_open(relid, NoLock);
+
+               pull_child_partition_columns(part_rel, root_rel, all_part_cols);
+               heap_close(part_rel, NoLock);

I don't like the fact that we're opening and closing the relation here
just to get information on the partitioning columns. I think it would
be better to do this someplace that already has the relation open and
store the details in the RelOptInfo. set_relation_partition_info()
looks like the right spot.

+void
+pull_child_partition_columns(Relation rel,
+                                                        Relation parent,
+                                                        Bitmapset **partcols)

This code has a lot in common with is_partition_attr(). I'm not sure
it's worth trying to unify them, but it could be done.

+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,

Instead of " : ", you could just write "is the".

+                * For Updates, if the leaf partition is already present in the
+                * per-subplan result rels, we re-use that rather than
initialize a
+                * new result rel. The per-subplan resultrels and the
resultrels of
+                * the leaf partitions are both in the same canonical
order. So while

It would be good to explain the reason. Also, Updates shouldn't be
capitalized here.

+ Assert(cur_update_rri <= update_rri +
num_update_rri - 1);

Maybe just cur_update_rri < update_rri + num_update_rri, or even
current_update_rri - update_rri < num_update_rri.

Also, +1 for Amit Langote's idea of trying to merge
mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#185

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Robert Haas (#184)

Re: UPDATE of partition key

On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote:

Also, +1 for Amit Langote's idea of trying to merge
mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

Currently I am trying to see if it simplifies things if we do that. We
will be merging these arrays into one, but we are adding a new int[]
array that maps subplans to leaf partitions. Will get back with how it
looks finally.

Robert, Amit , I will get back with your other review comments.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#186

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 8 years ago

In reply to: Amit Khandekar (#185)

Re: UPDATE of partition key

On 2017/11/07 14:40, Amit Khandekar wrote:

On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote:

Also, +1 for Amit Langote's idea of trying to merge
mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.

Currently I am trying to see if it simplifies things if we do that. We
will be merging these arrays into one, but we are adding a new int[]
array that maps subplans to leaf partitions. Will get back with how it
looks finally.

One thing to note is that the int[] array I mentioned will be much faster
to compute than going to convert_tuples_by_name() to build the additional
maps array.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#187

Thomas Munro

thomas.munro@enterprisedb.com

about 8 years ago

In reply to: Robert Haas (#184)

Re: UPDATE of partition key

On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The changes to trigger.c still make me super-nervous. Hey THOMAS
MUNRO, any chance you could review that part?

Looking, but here's one silly thing that jumped out at me while
getting started with this patch. I cannot seem to convince my macOS
system to agree with the expected sort order from :show_data, where
underscores precede numbers:

  part_a_10_a_20 | a | 10 | 200 |  1 |
  part_a_1_a_10  | a |  1 |   1 |  1 |
- part_d_1_15    | b | 15 | 146 |  1 |
- part_d_1_15    | b | 16 | 147 |  2 |
  part_d_15_20   | b | 17 | 155 | 16 |
  part_d_15_20   | b | 19 | 155 | 19 |
+ part_d_1_15    | b | 15 | 146 |  1 |
+ part_d_1_15    | b | 16 | 147 |  2 |

It seems that macOS (like older BSDs) just doesn't know how to sort
Unicode and falls back to sorting the bits. I expect that means that
the test will also fail on any other OS with "make check
LC_COLLATE=C". I believe our regression tests are supposed to pass
with a wide range of collations including C, so I wonder if this means
we should stick a leading zero on those single digit numbers, or
something, to stabilise the output.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#188

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Thomas Munro (#187)

1 attachment(s)

Re: UPDATE of partition key

On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The changes to trigger.c still make me super-nervous. Hey THOMAS
MUNRO, any chance you could review that part?

Looking, but here's one silly thing that jumped out at me while
getting started with this patch. I cannot seem to convince my macOS
system to agree with the expected sort order from :show_data, where
underscores precede numbers:
part_a_10_a_20 | a | 10 | 200 |  1 |
part_a_1_a_10  | a |  1 |   1 |  1 |
- part_d_1_15    | b | 15 | 146 |  1 |
- part_d_1_15    | b | 16 | 147 |  2 |
part_d_15_20   | b | 17 | 155 | 16 |
part_d_15_20   | b | 19 | 155 | 19 |
+ part_d_1_15    | b | 15 | 146 |  1 |
+ part_d_1_15    | b | 16 | 147 |  2 |
It seems that macOS (like older BSDs) just doesn't know how to sort
Unicode and falls back to sorting the bits. I expect that means that
the test will also fail on any other OS with "make check
LC_COLLATE=C". I believe our regression tests are supposed to pass
with a wide range of collations including C, so I wonder if this means
we should stick a leading zero on those single digit numbers, or
something, to stabilise the output.

I preferably need to retain the partition names. I have now added a
LOCALE "C" for partname like this :

-\set show_data 'select tableoid::regclass::text partname, * from
range_parted order by 1, 2, 3, 4, 5, 6'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname,
* from range_parted order by 1, 2, 3, 4, 5, 6'

Thomas, can you please try the attached incremental patch
regress_locale_changes.patch and check if the test passes ? The patch
is to be applied on the main v22 patch. If the test passes, I will
include these changes (also for list_parted) in the upcoming v23
patch.

Thanks
-Amit Khandekar

Attachments:

regress_locale_changes.patchapplication/octet-stream; name=regress_locale_changes.patchDownload

diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index a49980b..c39f87f 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -234,7 +234,7 @@ alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100)
 create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
 alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
 \set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
-\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
 :init_range_parted;
 :show_data;
     partname    | a | b  |  c  | d  | e 
@@ -406,9 +406,9 @@ NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,
  part_a_10_a_20 | a | 10 | 200 |  1 | 
  part_a_1_a_10  | a |  1 |   1 |  1 | 
  part_c_1_100   | b | 13 |  98 |  2 | 
- part_d_1_15    | b | 12 | 110 |  1 | 
  part_d_15_20   | b | 15 | 106 | 16 | 
  part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
 (6 rows)
 
 :init_range_parted;
@@ -428,10 +428,10 @@ NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,
 ----------------+---+----+-----+----+---
  part_a_10_a_20 | a | 10 | 200 |  1 | 
  part_a_1_a_10  | a |  1 |   1 |  1 | 
- part_d_1_15    | b | 12 | 146 |  1 | 
- part_d_1_15    | b | 13 | 147 |  2 | 
  part_d_15_20   | b | 15 | 155 | 16 | 
  part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
 (6 rows)
 
 drop trigger trans_updatetrig ON range_parted;
@@ -457,9 +457,9 @@ update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a
  part_a_10_a_20 | a | 10 | 200 |  1 | 
  part_a_1_a_10  | a |  1 |   1 |  1 | 
  part_c_1_100   | b | 15 |  98 |  2 | 
- part_d_1_15    | b | 15 | 110 |  1 | 
  part_d_15_20   | b | 17 | 106 | 16 | 
  part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
 (6 rows)
 
 :init_range_parted;
@@ -469,10 +469,10 @@ update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
 ----------------+---+----+-----+----+---
  part_a_10_a_20 | a | 10 | 200 |  1 | 
  part_a_1_a_10  | a |  1 |   1 |  1 | 
- part_d_1_15    | b | 15 | 146 |  1 | 
- part_d_1_15    | b | 16 | 147 |  2 | 
  part_d_15_20   | b | 17 | 155 | 16 | 
  part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
 (6 rows)
 
 drop trigger trig_c1_100 ON part_c_1_100;
@@ -521,7 +521,7 @@ create trigger d15_insert_trig
 -- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
 update range_parted set c = c - 50 where c > 97;
 NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
-select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:show_data;
     partname    | a | b  |  c  | d  | e 
 ----------------+---+----+-----+----+---
  part_a_10_a_20 | a | 10 | 150 |  1 | 
@@ -567,7 +567,7 @@ update part_def set a = 'd' where a = 'c';
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
 DETAIL:  Failing row contains (a, 9, null, null, null).
-select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:show_data;
     partname    | a | b  |  c  | d  | e 
 ----------------+---+----+-----+----+---
  part_a_10_a_20 | a | 10 | 200 |  1 | 
@@ -587,7 +587,7 @@ DETAIL:  Failing row contains (ad, 10, 200, 1, null).
 -- Success
 update range_parted set a = 'ad' where a = 'a';
 update range_parted set a = 'bd' where a = 'b';
-select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:show_data;
  partname | a  | b  |  c  | d  | e 
 ----------+----+----+-----+----+---
  part_def | ad |  1 |   1 |  1 | 
@@ -603,7 +603,7 @@ select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3,
 -- Success
 update range_parted set a = 'a' where a = 'ad';
 update range_parted set a = 'b' where a = 'bd';
-select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:show_data;
     partname    | a | b  |  c  | d  | e 
 ----------------+---+----+-----+----+---
  part_a_10_a_20 | a | 10 | 200 |  1 | 
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0ec5bb2..b5add01 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -151,7 +151,7 @@ create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
 alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
 
 \set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
-\set show_data 'select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
 :init_range_parted;
 :show_data;
 
@@ -310,7 +310,7 @@ create trigger d15_insert_trig
 
 -- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
 update range_parted set c = c - 50 where c > 97;
-select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:show_data;
 
 drop trigger parent_delete_trig ON range_parted;
 drop trigger parent_update_trig ON range_parted;
@@ -338,7 +338,7 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
-select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:show_data;
 
 -- Update row movement from non-default to default partition.
 -- Fail, default partition is not under part_a_10_a_20;
@@ -346,12 +346,12 @@ update part_a_10_a_20 set a = 'ad' where a = 'a';
 -- Success
 update range_parted set a = 'ad' where a = 'a';
 update range_parted set a = 'bd' where a = 'b';
-select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:show_data;
 -- Update row movement from default to non-default partitions.
 -- Success
 update range_parted set a = 'a' where a = 'ad';
 update range_parted set a = 'b' where a = 'bd';
-select tableoid::regclass::text partname, * from range_parted order by 1, 2, 3, 4;
+:show_data;
 
 create table list_parted (
 	a text,

#189

Thomas Munro

thomas.munro@enterprisedb.com

about 8 years ago

In reply to: Amit Khandekar (#188)

Re: UPDATE of partition key

On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Thomas, can you please try the attached incremental patch
regress_locale_changes.patch and check if the test passes ? The patch
is to be applied on the main v22 patch. If the test passes, I will
include these changes (also for list_parted) in the upcoming v23
patch.

That looks good. Thanks.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#190

Thomas Munro

thomas.munro@enterprisedb.com

about 8 years ago

In reply to: Amit Khandekar (#188)

Re: UPDATE of partition key

On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The changes to trigger.c still make me super-nervous. Hey THOMAS
MUNRO, any chance you could review that part?

At first, it seemed quite strange to me that row triggers and
statement triggers fire different events for the same modification.
Row triggers see DELETE + INSERT (necessarily because different
tables are involved), but this fact is hidden from the target table's
statement triggers.

The alternative would be for all triggers to see consistent events and
transitions. Instead of having your special case code in ExecInsert
and ExecDelete that creates the two halves of a 'synthetic' UPDATE for
the transition tables, you'd just let the existing ExecInsert and
ExecDelete code do its thing, and you'd need a flag to record that you
should also fire INSERT/DELETE after statement triggers if any rows
moved.

After sleeping on this question, I am coming around to the view that
the way you have it is right. The distinction isn't really between
row triggers and statement triggers, it's between triggers at
different levels in the hierarchy. It just so happens that we
currently only fire target table statement triggers and leaf table row
triggers. Future development ideas that seem consistent with your
choice:

1. If we ever allow row triggers with transition tables on child
tables, then I think *their* transition tables should certainly see
the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be
inconsistent with the OLD and NEW variables in a single trigger
invocation. (These were prohibited mainly due to lack of time and
(AFAIK) limited usefulness; I think they would need probably need
their own separate tuplestores, or possibly some kind of filtering.)

2. If we ever allow row triggers on partitioned tables (ie that fire
when its children are modified), then I think their UPDATE trigger
should probably fire when a row moves between any two (grand-)*child
tables, just as you have it for target table statement triggers. It
doesn't matter that the view from parent tables' triggers is
inconsistent with the view from leaf table triggers: it's a feature
that we 'hide' partitioning from the user to the extent we can so that
you can treat the partitioned table just like a table.

Any other views?

As for the code, I haven't figured out how to break it yet, and I'm
wondering if there is some way to refactor so that ExecInsert and
ExecDelete don't have to record pseudo-UPDATE trigger events.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#191

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Langote (#183)

1 attachment(s)

Re: UPDATE of partition key

On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

ISTM, ModifyTableState now has one too
many TupleConversionMap pointer arrays after the patch, creating the need
to choose from in the first place. AIUI -

* mt_perleaf_parentchild_maps:

- each entry is a map to convert root parent's tuples to a given leaf
partition's format

- used to be called mt_partition_tupconv_maps and is needed when tuple-
routing is in use; for both INSERT and UPDATE with tuple-routing

- as many entries in the array as there are leaf partitions and stored
in the partition bound order

* mt_perleaf_childparent_maps:

- each entry is a map to convert a leaf partition's tuples to the root
parent's format

- newly added by this patch and seems to be needed for UPDATE with
tuple-routing for two needs: 1. tuple-routing should start with a
tuple in root parent format whereas the tuple received is in leaf
partition format when ExecInsert() called for update-tuple-routing (by
ExecUpdate), 2. after tuple-routing, we must capture the tuple
inserted into the partition in the transition tuplestore which accepts
tuples in root parent's format

- as many entries in the array as there are leaf partitions and stored
in the partition bound order

* mt_persubplan_childparent_maps:

- each entry is a map to convert a child table's tuples to the root
parent's format

- used to be called mt_transition_tupconv_maps and needed for converting
child tuples to the root parent's format when storing them in the
transition tuplestore which accepts tuples in root parent's format

- as many entries in the array as there are sub-plans in mt_plans and
stored in either the partition bound order or unknown order (the
latter in the regular inheritance case)

thanks for the detailed description. Yet that's correct.

I think we could combine the last two into one. The only apparent reason
for them to be separate seems to be that the subplan array might contain
less entries than perleaf array and ExecInsert() has only enough
information to calculate the offset of a map in the persubplan array.
That is, resultRelInfo of leaf partition that ExecInsert starts with in
the update-tuple-routing case comes from mtstate->resultRelInfo array
which contains only mt_nplans entries. So, if we only have the array with
entries for *all* partitions, it's hard to get the offset of the map to
use in that array.

I suggest we don't add a new map array and a significant amount of new
code to initialize the same and to implement the logic to choose the
correct array to get the map from. Instead, we could simply add an array
of integers with mt_nplans entries. Each entry is an offset of a given
sub-plan in the array containing entries of something for *all*
partitions. Since, we are teaching ExecSetupPartitionTupleRouting() to
reuse ResultRelInfos from mtstate->resultRelInfos, there is a suitable
place to construct such array. Let's say the array is called
mt_subplan_partition_offsets[]. Let ExecSetupPartitionTupleRouting() also
initialize the parent-to-partition maps for *all* partitions, in the
update-tuple-routing case. Then add a quick-return check in
ExecSetupTransitionCaptureState() to see if the map has already been set
by ExecSetupPartitionTupleRouting(). Since we're using the same map for
two purposes, we could rename mt_transition_tupconv_maps to something that
doesn't bind it to its use only for transition tuple capture.

I was trying hard to verify whether this is really going to simplify
the code. We are removing one array and adding one. In my approach,
the map structures are anyway shared, they are not duplicated. Because
I have separate arrays to access the tuple conversion map
partition-based or subplan-based, there is no need for extra logic to
get into the per-partition array. But on the other hand, we need not
do that many changes in ExecSetupTransitionCaptureState() that I have
done, although my patch hasn't resulted in more number of line in that
function; it has just changed the logic.

Also, each time we access the map, we need to know whether it is
per-plan or per-partition, according to a set of factors like whether
transition tables are there and whether tuple routing is there.

But I realized that one plus point of your approach is that it is
going to be extensible if we later need to have some more per-subplan
information that is already there in a partition-wise array. In that
case, we just need to re-use the int[] map; we don't have to create
two new separate arrays; just create one per-leaf array, and use the
map to get into one of its elements, given a per-subplan index.

So I went ahead and did the changes :

New mtstate maps :

TupleConversionMap **mt_parentchild_tupconv_maps;
/* Per partition map for tuple conversion from root to leaf */
TupleConversionMap **mt_childparent_tupconv_maps;
/* Per plan/partition map for tuple conversion from child to root */
int *mt_subplan_partition_offsets;
/* Stores position of update result rels in leaf partitions */

We need to know whether mt_childparent_tupconv_maps is per-plan or
per-partition. Each time this map is accessed, it is tedious to go
through conditions that determine whether that map is per-partition or
not. Here are the conditions :

For transition tables
per-leaf map needed : in presence of tuple routing (insert or
update, whichever).
per-plan map needed : in presence of simple update (i.e. routing
not involved)
For update tuple routing.
per-plan map needed : always

So instead, added a new bool mtstate->mt_is_tupconv_perpart field that
is set to true only while setting up transition tables and that too
only when tuple routing is to be done.

Since both transition tables and update tuple routing need a
child-parent map, extracted the code to build the map into a common
function ExecSetupChildParentMap(). (I think I could have done this
earlier also)

Each time we need to access this map, we not only have to use the
int[] maps, we also need to first check if it's a per-leaf map. So put
this logic in tupconv_map_for_subplan() and used this everywhere we
need the map.

Attached is v23 patch that has just the above changes (and also
rebased on hash-partitioning changes, like update.sql). I am still
doing some sanity testing on this, although regression passes.

I am yet to respond to the other review comments; will do that with a v24 patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v23.patchapplication/octet-stream; name=update-partition-key_v23.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index daba66c..6aac456 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3297,9 +3302,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, during the row movement, the row is still visible for the
+       concurrent session, and it is about to do an <command>UPDATE</command>
+       or <command>DELETE</command> operation on the same row. This DML
+       operation can silently miss this row if the row now gets deleted from
+       the partition by the first session as part of its
+       <command>UPDATE</command> row movement. In such case, the concurrent
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, the second
+       session would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 0e99aa9..bd57f3f 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</> and
+   <command>INSERT</> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</> or <command>DELETE</> on the same row may miss
+   this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index b0e160a..a8b000a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to
+    move to another partition, it will be performed as a
+    <command>DELETE</command> from the original partition followed by
+    <command>INSERT</command> into the new partition. In this case, all
+    row-level <literal>BEFORE</> <command>UPDATE</command> triggers and all
+    row-level <literal>BEFORE</> <command>DELETE</command> triggers are fired
+    on the original partition. Then all row-level <literal>BEFORE</>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</>
+    triggers are concerned, <literal>AFTER</> <command>DELETE</command> and
+    <literal>AFTER</> <command>INSERT</command> triggers are applied; but
+    <literal>AFTER</> <command>UPDATE</command> triggers are not applied
+    because the <command>UPDATE</command> has been converted to a
+    <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index cff59ed..1408cd6 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1442,7 +1442,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1455,8 +1456,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1465,14 +1466,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2873,6 +2874,79 @@ error_exit:
 }
 
 /*
+ * pull_child_partition_columns
+ *
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the 'partcols' bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int16		partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*partcols =
+			bms_add_member(*partcols,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8f1a8ed..547a18b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2478,11 +2478,14 @@ CopyFrom(CopyState cstate)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
 									   &partition_tupconv_maps,
+									   NULL,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		cstate->partition_dispatch_info = partition_dispatch_info;
@@ -2748,7 +2751,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..319aa6f 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,27 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For capturing transition tuples for UPDATE events fired during
+		 * partition row movement, either oldtup or newtup can be NULL,
+		 * depending on whether the event is for row being deleted from old
+		 * partition or it's for row being inserted into the new partition. But
+		 * in any case, oldtup should always be non-NULL for DELETE events, and
+		 * newtup should be non-NULL for INSERT events, because for transition
+		 * capture with partition row movement, INSERT and DELETE events don't
+		 * fire; only UPDATE event is fired.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5506,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5506,7 +5531,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 493ff82..520dfd3 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -104,9 +104,6 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 									 int maxfieldlen);
 static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 				  Plan *planTree);
-static void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-				   TupleTableSlot *slot, EState *estate);
-
 /*
  * Note that GetUpdatedColumns() also exists in commands/trigger.c.  There does
  * not appear to be any good header to put it into, given the structures that
@@ -1850,15 +1847,10 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  */
-static void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1878,66 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
+
+	/* See the comments in ExecConstraints. */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if requested,
+ * checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1945,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2062,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
@@ -3242,6 +3250,13 @@ EvalPlanQualEnd(EPQState *epqstate)
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *		instead of allocating new ones while generating the array of all leaf
+ *		partition result rels.
+ *
+ * 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
+ *      this is 0.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
@@ -3265,11 +3280,14 @@ EvalPlanQualEnd(EPQState *epqstate)
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
 							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
+							   int **subplan_leaf_map,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -3277,7 +3295,8 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	int			update_rri_index = 0;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -3286,11 +3305,45 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
+	if (subplan_leaf_map)
+		*subplan_leaf_map = NULL;
 	*partitions = (ResultRelInfo **) palloc(*num_partitions *
 											sizeof(ResultRelInfo *));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		*subplan_leaf_map = palloc(num_update_rri * sizeof(int));
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -3299,20 +3352,66 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel ? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when converting tuple as per root
+				 * partition tuple descriptor. When generating the update
+				 * plans, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				(*subplan_leaf_map)[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't
+		 * initialized the result rel as well.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -3322,14 +3421,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -3345,9 +3440,15 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(num_update_rri == 0 || update_rri_index == num_update_rri);
 }
 
 /*
@@ -3373,8 +3474,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	result = get_partition_for_tuple(pd, slot, estate,
 									 &failed_at, &failed_slot);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0027d21..750b0f7 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -45,6 +45,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -62,7 +63,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -240,6 +250,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for converting tuple and
+ * storing it into a tuple slot provided through 'new_slot', which typically
+ * should be one of the dedicated partition tuple slot. Passes the partition
+ * tuple slot back into output param p_old_slot. If no mapping present, keeps
+ * p_old_slot unchanged.
+ *
+ * Returns the converted tuple.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_old_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_old_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -265,6 +307,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -281,17 +324,50 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * If the original operation is UPDATE, the root partitioned table
+		 * needs to be fetched from mtstate->rootResultRelInfo.
+		 */
+		rootResultRelInfo = (mtstate->rootResultRelInfo ?
+							 mtstate->rootResultRelInfo : resultRelInfo);
+
+		/*
+		 * If the resultRelInfo is not the root partitioned table (which
+		 * happens for UPDATE), we should convert the tuple into root's tuple
+		 * descriptor, since ExecFindPartition() starts the search from root.
+		 * The tuple conversion map list is in the order of
+		 * mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
+		 * we need to know the position of the resultRel in
+		 * mtstate->resultRelInfo[].
+		 */
+		if (rootResultRelInfo != resultRelInfo)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+			TupleConversionMap *tupconv_map;
+
+			/* resultRelInfo must be one of the per-subplan result rels. */
+			Assert(resultRelInfo >= mtstate->resultRelInfo &&
+				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
+
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  mtstate->mt_root_tuple_slot,
+											  &slot);
+		}
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_parentchild_tupconv_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -330,8 +406,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -344,30 +422,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_parentchild_tupconv_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -485,7 +554,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -621,9 +690,31 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition NEW TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -677,6 +768,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *delete_skipped,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -684,6 +777,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;
+
+	if (delete_skipped)
+		*delete_skipped = true;
 
 	/*
 	 * get information on the (current) result relation
@@ -848,12 +945,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (delete_skipped)
+		*delete_skipped = false;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -946,6 +1070,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1042,12 +1167,82 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		delete_skipped;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &delete_skipped, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (for e.g. trigger
+			 * prevented it, or it was already deleted by self, or it was
+			 * concurrently deleted by another transaction), then we should
+			 * skip INSERT as well, otherwise, there will be effectively one
+			 * new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (delete_skipped)
+				return NULL;
+
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Revert back to the transition capture map created for
+				 * UPDATE; otherwise the next UPDATE will incorrectly use the
+				 * one created for INESRT.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip
+		 * checking them here.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1475,7 +1670,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1504,55 +1698,113 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 							 mtstate->mt_num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(mtstate->mt_partition_dispatch_info != NULL));
+
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+		return;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based
+		 * on the partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+
+		Assert(mtstate->mt_subplan_partition_offsets != NULL);
+		leaf_index = mtstate->mt_subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < mtstate->mt_num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1659,15 +1911,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1783,7 +2033,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1828,9 +2079,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->part_cols_updated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1903,6 +2157,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1940,31 +2203,51 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
+		int *subplan_leaf_map;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   mtstate->resultRelInfo,
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
 									   &partition_tupconv_maps,
+									   &subplan_leaf_map,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1975,6 +2258,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the partition attnos to the root attno.
+	 * This is required when during update row movement the tuple descriptor of
+	 * a source partition does not match the root partitioned table descriptor.
+	 * In such a case we need to convert tuples to the root tuple descriptor,
+	 * because the search for destination partition starts from the root.  Skip
+	 * this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -2004,26 +2299,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2032,17 +2330,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2059,7 +2366,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2096,22 +2403,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2356,6 +2676,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2390,11 +2711,23 @@ ExecEndModifyTable(ModifyTableState *node)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_root_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index cadd253..35edd66 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 291d1ee..48099ca 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2100,6 +2101,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42c595d..7293d8a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(part_cols_updated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9c74e39..524ba00 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2371,6 +2372,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->part_cols_updated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6427,6 +6429,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool part_cols_updated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6453,6 +6456,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->part_cols_updated = part_cols_updated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9b7a8fd..9a8015e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -111,6 +111,10 @@ typedef struct
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
+static void get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols);
 static void inheritance_planner(PlannerInfo *root);
 static void grouping_planner(PlannerInfo *root, bool inheritance_update,
 				 double tuple_fraction);
@@ -1048,6 +1052,40 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
 }
 
 /*
+ * get_all_partition_cols
+ *	  Get attribute numbers of all partition key columns of all the partitioned
+ *    tables.
+ *
+ * All the child partition attribute numbers are converted to the root
+ * partitioned table.
+ */
+static void
+get_all_partition_cols(List *rtables,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols)
+{
+	ListCell   *lc;
+	Oid			root_relid = getrelid(root_rti, rtables);
+	Relation	root_rel;
+
+	/* The caller must have already locked all the partitioned tables. */
+	root_rel = heap_open(root_relid, NoLock);
+	*all_part_cols = NULL;
+	foreach(lc, partitioned_rels)
+	{
+		Index		rti = lfirst_int(lc);
+		Oid			relid = getrelid(rti, rtables);
+		Relation	part_rel = heap_open(relid, NoLock);
+
+		pull_child_partition_columns(part_rel, root_rel, all_part_cols);
+		heap_close(part_rel, NoLock);
+	}
+
+	heap_close(root_rel, NoLock);
+}
+
+/*
  * inheritance_planner
  *	  Generate Paths in the case where the result relation is an
  *	  inheritance set.
@@ -1092,6 +1130,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		part_cols_updated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1162,10 +1201,23 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
 		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		/*
+		 * Retrieve the partition key columns of all the partitioned tables,
+		 * so as to check whether any of the columns being updated is
+		 * a partition key of any of the partition tables.
+		 */
+		get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+							   partitioned_rels, &all_part_cols);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			part_cols_updated = true;
 	}
 
 	/*
@@ -1503,6 +1555,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 part_cols_updated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2120,6 +2173,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 36ec025..3c93952 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3194,6 +3194,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'part_cols_updated' if any partitioning columns are being updated, either
+ *		from the named relation or a descendent partitione table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3207,6 +3209,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3274,6 +3277,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->part_cols_updated = part_cols_updated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 8acc01a..6a18d32 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -85,8 +85,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
@@ -104,6 +104,9 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
 						EState *estate,
 						PartitionDispatchData **failed_at,
 						TupleTableSlot **failed_slot);
+extern void pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c4ecf0d..f39bb8d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,7 +187,10 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
@@ -207,17 +210,22 @@ extern void EvalPlanQualSetTuple(EPQState *epqstate, Index rti,
 					 HeapTuple tuple);
 extern HeapTuple EvalPlanQualGetTuple(EPQState *epqstate, Index rti);
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
 							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
+							   int **subplan_leaf_map,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot, EState *estate);
 
 #define EvalPlanQualSetSlot(epqstate, slot)  ((epqstate)->origslot = (slot))
 extern void EvalPlanQualFetchRowMarks(EPQState *epqstate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e05bc04..d2e8060 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -982,15 +982,19 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_root_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_parentchild_tupconv_maps;
+	/* Per partition map for tuple conversion from root to leaf */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
+	int		*mt_subplan_partition_offsets;
+	/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index dd74efa..c414755 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 05fc9a3..30d307d 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		part_cols_updated;	/* some part col in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2109,6 +2110,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..4b4485f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool part_cols_updated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index a4fe961..50b76cf 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,367 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (null, 85, b, 15, 105).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, b, 7, 2).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
+Partition constraint: (NOT (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +566,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +629,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +755,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..a07f113 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,229 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +338,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +367,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +466,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#192

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Thomas Munro (#190)

Re: UPDATE of partition key

On 9 November 2017 at 09:27, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Wed, Nov 8, 2017 at 5:57 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 8 November 2017 at 07:55, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Nov 7, 2017 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The changes to trigger.c still make me super-nervous. Hey THOMAS
MUNRO, any chance you could review that part?

At first, it seemed quite strange to me that row triggers and
statement triggers fire different events for the same modification.
Row triggers see DELETE + INSERT (necessarily because different
tables are involved), but this fact is hidden from the target table's
statement triggers.

The alternative would be for all triggers to see consistent events and
transitions. Instead of having your special case code in ExecInsert
and ExecDelete that creates the two halves of a 'synthetic' UPDATE for
the transition tables, you'd just let the existing ExecInsert and
ExecDelete code do its thing, and you'd need a flag to record that you
should also fire INSERT/DELETE after statement triggers if any rows
moved.

Yeah I also had thought about that. But thought that change was too
invasive. For e.g. letting ExecARInsertTriggers() do the transition
capture even when transition_capture->tcs_update_new_table is set.

I was also thinking of having a separate function to *only* add the
transition table rows. So in ExecInsert, call this one instead of
ExecARUpdateTriggers(). But realized that the existing
ExecARUpdateTriggers() looks like a better, robust interface with all
its checks. Just that calling ExecARUpdateTriggers() sounds like we
are also firing trigger; we are not firing any trigger or saving any
event, we are just adding the transition row.

After sleeping on this question, I am coming around to the view that
the way you have it is right. The distinction isn't really between
row triggers and statement triggers, it's between triggers at
different levels in the hierarchy. It just so happens that we
currently only fire target table statement triggers and leaf table row
triggers.

Yes. And rows are there only in leaf partitions. So we have to
simulate as though the target table has these rows. Like you
mentioned, the user has to get the impression of a normal table. So we
have to do something extra to capture the rows.

Future development ideas that seem consistent with your choice:

1. If we ever allow row triggers with transition tables on child
tables, then I think *their* transition tables should certainly see
the deletes and inserts, otherwise OLD TABLE and NEW TABLE would be
inconsistent with the OLD and NEW variables in a single trigger
invocation. (These were prohibited mainly due to lack of time and
(AFAIK) limited usefulness; I think they would need probably need
their own separate tuplestores, or possibly some kind of filtering.)

As we know, for row triggers on leaf partitions, we treat them as
normal tables, so a trigger written on a leaf partition sees only the
local changes. The trigger is unaware whether the insert is part of an
UPDATE row movement. Similarly, the transition table referenced by
that row trigger function should see only the NEW table, not the old
table.

2. If we ever allow row triggers on partitioned tables (ie that fire
when its children are modified), then I think their UPDATE trigger
should probably fire when a row moves between any two (grand-)*child
tables, just as you have it for target table statement triggers.

Yes I agree.

It doesn't matter that the view from parent tables' triggers is
inconsistent with the view from leaf table triggers: it's a feature
that we 'hide' partitioning from the user to the extent we can so that
you can treat the partitioned table just like a table.

Any other views?

I think because because there is no provision for a row trigger on
partitioned table, users who want to have a common trigger on a
partition subtree, has no choice but to create the same trigger
individually on the leaf partitions. And that's the reason we cannot
handle an update row movement with triggers without anomalies.

Thanks
-Amit Khandekar

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#193

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: Amit Khandekar (#191)

Re: UPDATE of partition key

On 10 November 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
[ update-partition-key_v23.patch ]

Hi Amit,

Thanks for working on this. I'm looking forward to seeing this go in.

So... I've signed myself up to review the patch, and I've just had a
look at it, (after first reading this entire email thread!).

Overall the patch looks like it's in quite a good shape. I think I do
agree with Robert about the UPDATE anomaly that's been discussed. I
don't think we're painting ourselves into any corner by not having
this working correctly right away. Anyone who's using some trigger
workarounds for the current lack of support for updating the partition
key is already going to have the same issues, so at least this will
save them some troubles implementing triggers and give them much
better performance. I see you've documented this fact too, which is
good.

I'm writing this email now as I've just run out of review time for today.

Here's what I noted down during my first pass:

1. Closing command tags in docs should not be abbreviated

triggers are concerned, <literal>AFTER</> <command>DELETE</command> and

This changed in c29c5789. I think Peter will be happy if you don't
abbreviate the closing tags.

2. "about to do" would read better as "about to perform"

concurrent session, and it is about to do an <command>UPDATE</command>

I think this paragraph could be more clear if we identified the
sessions with a number.

Perhaps:
Suppose, session 1 is performing an <command>UPDATE</command> on a
partition key, meanwhile, session 2 tries to perform an <command>UPDATE
</command> or <command>DELETE</command> operation on the same row.
Session 2 can silently miss the row due to session 1's activity. In
such a case, session 2's <command>UPDATE</command>/<command>DELETE
</command>, being unaware of the row's movement, interprets this that the
row has just been deleted, so there is nothing to be done for this row.
Whereas, in the usual case where the table is not partitioned, or where
there is no row movement, the second session would have identified the
newly updated row and carried <command>UPDATE</command>/<command>DELETE
</command> on this new row version.

3. Integer width. get_partition_natts returns int but we assign to int16.

int16 partnatts = get_partition_natts(key);

Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber
but that's existingly not correct.

4. The following code could just pull_varattnos(partexprs, 1, &child_keycols);

foreach(lc, partexprs)
{
Node *expr = (Node *) lfirst(lc);

pull_varattnos(expr, 1, &child_keycols);
}

5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
do something
special when the DELETE/INSERT is a partition move? I have audit
tables in mind here
it may appear as though a user performed a DELETE when they actually
performed an UPDATE
giving visibility of this to the trigger function will allow the
application to work around this.

6. change "row" to "a row" and "old" to "the old"

* depending on whether the event is for row being deleted from old

But to be honest, I'm having trouble parsing the comment. I think it
would be better to
say explicitly when the row will be NULL rather than "depending on
whether the event"

7. I'm confused with how this change came about. If the old comment
was correct here then the comment you're referring to here should
remain in ExecPartitionCheck(), but you're saying it's in
ExecConstraints().

/* See the comments in ExecConstraints. */

If the comment really is in ExecConstraints(), then you might want to
give an overview of what you mean, then reference ExecConstraints() if
more details are required.

8. I'm having trouble parsing this comment:

* 'update_rri' has the UPDATE per-subplan result rels.

I think "has" should be "contains" ?

9. Also, this should likely be reworded:

* 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
* this is 0.

'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

10. There should be no space before the '?'

/* Is this leaf partition present in the update resultrel ? */

11. I'm struggling to understand this comment:

* This is required when converting tuple as per root
* partition tuple descriptor.

"tuple" should probably be "the tuple", but not quite sure what you
mean by "as per root".

I may have misunderstood, but maybe it should read:

* This is required when we convert the partition's tuple to
* be compatible with the partitioned table's tuple descriptor.

12. I think "as well" would be better written as "either".

* If we didn't open the partition rel, it means we haven't
* initialized the result rel as well.

13. I'm unsure what is meant by the following comment:

* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.

I'm not quite sure where UPDATE comes in here as we're only checking for INSERT?

14. Use of underscores instead of camelCase.

COPY_SCALAR_FIELD(part_cols_updated);

I know you're not the first one to break this as "partitioned_rels"
does not follow it either, but that's probably not a good enough
reason to break away from camelCase any further.

I'd suggest "partColsUpdated". But after a re-think, maybe cols is
incorrect. All columns are partitioned, it's the key columns that we
care about, so how about "partKeyUpdate"

15. Are you sure that you mean "root" here?

* All the child partition attribute numbers are converted to the root
* partitioned table.

Surely this is just the target relation. "parent" maybe? A
sub-partitioned table might be the target of an UPDATE too.

15. I see get_all_partition_cols() is just used once to check if
parent_rte->updatedCols contains and partition keys.

Would it not be better to reform that function and pass
parent_rte->updatedCols in and abort as soon as you see a single
match?

Maybe the function could return bool and be named
partitioned_key_overlaps(), that way your assignment in
inheritance_planner() would just become:

part_cols_updated = partitioned_key_overlaps(root->parse->rtable,
top_parentRTindex, partitioned_rels, parent_rte->updatedCols);

or something like that anyway.

16. Typo in comment

* 'part_cols_updated' if any partitioning columns are being updated, either
* from the named relation or a descendent partitione table.

"partitione" should be "partitioned". Also, normally for bool
parameters, we might word things like "True if ..." rather than just
"if"

You probably should follow camelCase I mentioned in 14 here too.

17. Comment needs a few changes:

* ConvertPartitionTupleSlot -- convenience function for converting tuple and
* storing it into a tuple slot provided through 'new_slot', which typically
* should be one of the dedicated partition tuple slot. Passes the partition
* tuple slot back into output param p_old_slot. If no mapping present, keeps
* p_old_slot unchanged.
*
* Returns the converted tuple.

There are a few typos here. For example, "tuple" should be "a tuple",
but maybe the comment should just be worded like:

* ConvertPartitionTupleSlot -- convenience function for tuple conversion
* using 'map'. The tuple, if converted, is stored in 'new_slot' and
* 'p_old_slot' is set to the original partition tuple slot. If map is NULL,
* then the original tuple is returned unmodified, otherwise the converted
* tuple is returned.

18. Line goes over 80 chars.

TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

Better just to split the declaration and assignment.

19. Confusing comment:

/*
* If the original operation is UPDATE, the root partitioned table
* needs to be fetched from mtstate->rootResultRelInfo.
*/

It's not that clear here how you determine this is an UPDATE of a
partitioned key.

20. This code looks convoluted:

rootResultRelInfo = (mtstate->rootResultRelInfo ?
mtstate->rootResultRelInfo : resultRelInfo);

/*
* If the resultRelInfo is not the root partitioned table (which
* happens for UPDATE), we should convert the tuple into root's tuple
* descriptor, since ExecFindPartition() starts the search from root.
* The tuple conversion map list is in the order of
* mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
* we need to know the position of the resultRel in
* mtstate->resultRelInfo[].
*/
if (rootResultRelInfo != resultRelInfo)
{

rootResultRelInfo is assigned via a ternary expression which makes the
subsequent if test seem a little strange.

Would it not be better to test:

if (mtstate->rootResultRelInfo)
{
rootResultRelInfo = mtstate->rootResultRelInfo
... other stuff ...
}
else
rootResultRelInfo = resultRelInfo;

Then above the if test you can explain that rootResultRelInfo is only
set during UPDATE of partition keys, as per #19.

21. How come you renamed mt_partition_tupconv_maps[] to
mt_parentchild_tupconv_maps[]?

22. Comment in ExecInsert() could be worded better.

/*
* In case this is part of update tuple routing, put this row into the
* transition NEW TABLE if we are capturing transition tables. We need to
* do this separately for DELETE and INSERT because they happen on
* different tables.
*/

/*
* This INSERT may be the result of a partition-key-UPDATE. If so,
* and we're required to capture transition tables then we'd better
* record this as a statement level UPDATE on the target relation.
* We're not interested in the statement level DELETE or INSERT as
* these occur on the individual partitions, none of which are the
* target of this the UPDATE statement.
*/

A similar comment could use a similar improvement in ExecDelete()

23. Line is longer than 80 chars.

TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

24. I know from reading the thread this name has changed before, but I
think delete_skipped seems like the wrong name for this variable in:

if (delete_skipped)
*delete_skipped = true;

Skipped is the wrong word here as that indicates like we had some sort
of choice and that we decided not to. However, that's not the case
when the tuple was concurrently deleted. Would it not be better to
call it "tuple_deleted" or even "success" and reverse the logic? It's
just a bit confusing that you're setting this to skipped before
anything happens. It would be nicer if there was a better way to do
this whole thing as it's a bit of a wart in the code. I understand why
the code exists though.

Also, I wonder if it's better to always pass a boolean here to save
having to test for NULL before setting it, that way you might consider
putting the success = false just before the return NULL, then do
success = true after the tuple is gone.
Failing that, putting: something like: success = false; /* not yet! */
where you're doing the if (deleted_skipped) test is might also be
better.

25. Comment "we should" should be "we must".

/*
* For some reason if DELETE didn't happen (for e.g. trigger
* prevented it, or it was already deleted by self, or it was
* concurrently deleted by another transaction), then we should
* skip INSERT as well, otherwise, there will be effectively one
* new row inserted.

Maybe just:
/* If the DELETE operation was unsuccessful, then we must not
* perform the INSERT into the new partition.

"for e.g." is not really correct in English. "For example, ..." or
just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've
written "For exempli gratia", which translates to "For for example".

26. You're not really explaining what's going on here:

if (mtstate->mt_transition_capture)
saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

You have a comment later to say you're about to "Revert back to the
transition capture map", but I missed the part that explained about
modifying it in the first place.

27. Comment does not explain how we're skipping checking the partition
constraint check in:

* We have already checked partition constraints above, so skip
* checking them here.

Maybe something like:

* We've already checked the partition constraint above, however, we
* must still ensure the tuple passes all other constraints, so we'll
* call ExecConstraints() and have it validate all remaining checks.

28. For table WITH OIDs, the OID should probably follow the new tuple
for partition-key-UPDATEs.

CREATE TABLE p (a BOOL NOT NULL, b INT NOT NULL) PARTITION BY LIST (a)
WITH OIDS;
CREATE TABLE P_true PARTITION OF p FOR VALUES IN('t');
CREATE TABLE P_false PARTITION OF p FOR VALUES IN('f');
INSERT INTO p VALUES('t', 10);
SELECT tableoid::regclass,oid,a FROM p;
tableoid | oid | a
----------+-------+---
p_true | 16792 | t
(1 row)

UPDATE p SET a = 'f'; -- partition-key-UPDATE (oid has changed (it
probably shouldn't have))
SELECT tableoid::regclass,oid,a FROM p;
tableoid | oid | a
----------+-------+---
p_false | 16793 | f
(1 row)

UPDATE p SET b = 20; -- non-partition-key-UPDATE (oid remains the same)

SELECT tableoid::regclass,oid,a FROM p;
tableoid | oid | a
----------+-------+---
p_false | 16793 | f
(1 row)

I'll try to continue with the review tomorrow, but I think some other
reviews are also looming too.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#194

Thomas Munro

thomas.munro@enterprisedb.com

about 8 years ago

In reply to: Amit Khandekar (#191)

Re: [HACKERS] UPDATE of partition key

On Fri, Nov 10, 2017 at 4:42 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is v23 patch that has just the above changes (and also
rebased on hash-partitioning changes, like update.sql). I am still
doing some sanity testing on this, although regression passes.

The test coverage[1]https://codecov.io/gh/postgresql-cfbot/postgresql/commit/a3beb8d8f598a64d75aa4b3afc143a5d3e3f7826 is 96.62%. Nice work. Here are the bits that
aren't covered:

In partition.c's pull_child_partition_columns(), the following loop is
never run:

+       foreach(lc, partexprs)
+       {
+               Node       *expr = (Node *) lfirst(lc);
+
+               pull_varattnos(expr, 1, &child_keycols);
+       }

In nodeModifyTable.c, the following conditional branches are never run:

                if (mtstate->mt_oc_transition_capture != NULL)
+               {
+                       Assert(mtstate->mt_is_tupconv_perpart == true);
                        mtstate->mt_oc_transition_capture->tcs_map =
-
mtstate->mt_transition_tupconv_maps[leaf_part_index];
+
mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+               }

if (node->mt_oc_transition_capture != NULL)
{
-
Assert(node->mt_transition_tupconv_maps != NULL);

node->mt_oc_transition_capture->tcs_map =
-
node->mt_transition_tupconv_maps[node->mt_whichplan];
+
tupconv_map_for_subplan(node, node->mt_whichplan);
}

Is there any reason we shouldn't be able to test these paths?

[1]: https://codecov.io/gh/postgresql-cfbot/postgresql/commit/a3beb8d8f598a64d75aa4b3afc143a5d3e3f7826

--
Thomas Munro
http://www.enterprisedb.com

#195

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: David Rowley (#193)

Re: [HACKERS] UPDATE of partition key

On 14 November 2017 at 01:55, David Rowley <david.rowley@2ndquadrant.com> wrote:

I'll try to continue with the review tomorrow, but I think some other
reviews are also looming too.

I started looking at this again today. Here's the remainder of my review.

29. ExecSetupChildParentMap gets called here for non-partitioned relations.
Maybe that's not the best function name? The function only seems to do
that when perleaf is True.

Is a leaf a partition of a partitioned table? It's not that clear the
meaning here.

/*
* If we found that we need to collect transition tuples then we may also
* need tuple conversion maps for any children that have TupleDescs that
* aren't compatible with the tuplestores. (We can share these maps
* between the regular and ON CONFLICT cases.)
*/
if (mtstate->mt_transition_capture != NULL ||
mtstate->mt_oc_transition_capture != NULL)
{
int numResultRelInfos;

numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
mtstate->mt_num_partitions :
mtstate->mt_nplans);

ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
(mtstate->mt_partition_dispatch_info != NULL));

30. The following chunk of code is giving me a headache trying to
verify which arrays are which size:

ExecSetupPartitionTupleRouting(rel,
mtstate->resultRelInfo,
(operation == CMD_UPDATE ? nplans : 0),
node->nominalRelation,
estate,
&partition_dispatch_info,
&partitions,
&partition_tupconv_maps,
&subplan_leaf_map,
&partition_tuple_slot,
&num_parted, &num_partitions);
mtstate->mt_partition_dispatch_info = partition_dispatch_info;
mtstate->mt_num_dispatch = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
mtstate->mt_partition_tuple_slot = partition_tuple_slot;
mtstate->mt_root_tuple_slot = MakeTupleTableSlot();

I know this patch is not completely responsible for it, but you're not
making things any better.

Would it not be better to invent some PartitionTupleRouting struct and
make that struct a member of ModifyTableState and CopyState, then just
pass the pointer to that struct to ExecSetupPartitionTupleRouting()
and have it fill in the required details? I think the complexity of
this is already on the high end, I think you really need to do the
refactor before this gets any worse.

The signature of the function is a bit scary!

extern void ExecSetupPartitionTupleRouting(Relation rel,
ResultRelInfo *update_rri,
int num_update_rri,
Index resultRTindex,
EState *estate,
PartitionDispatch **pd,
ResultRelInfo ***partitions,
TupleConversionMap ***tup_conv_maps,
int **subplan_leaf_map,
TupleTableSlot **partition_tuple_slot,
int *num_parted, int *num_partitions);

What do you think?

31. The following code seems incorrect:

/*
* If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
* need to do update tuple routing.
*/
if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

Shouldn't this be setting update_tuple_routing_needed to false if
there are no before row update triggers? Otherwise, you're setting it
to true regardless of if there are any partition key columns being
UPDATEd. That would make the work you're doing in
inheritance_planner() to set part_cols_updated a waste of time.

Also, this bit of code is a bit confused.

/* Decide whether we need to perform update tuple routing. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
update_tuple_routing_needed = false;

/*
* Build state for tuple routing if it's an INSERT or if it's an UPDATE of
* partition key.
*/
if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
(operation == CMD_INSERT || update_tuple_routing_needed))

The first if test would not be required if you fixed the code where
you set update_tuple_routing_needed = true regardless if its a
partitioned table or not.

So basically, you need to take the node->part_cols_updated from the
planner, if that's true then perform your test for before row update
triggers, set a bool to false if there are none, then proceed to setup
the partition tuple routing for partition table inserts or if your
bool is still true. Right?

32. "WCO" abbreviation is not that common and might need to be expanded.

* Below are required as reference objects for mapping partition
* attno's in expressions such as WCO and RETURNING.

Searching for other comments which mention "WCO" they're all around
places that is easy to understand they mean "With Check Option", e.g.
next to a variable with a more descriptive name. That's not the case
here.

33. "are anyway newly allocated", should "anyway" be "always"?
Otherwise, it does not make sense.

* If this result rel is one of the subplan result rels, let
* ExecEndPlan() close it. For INSERTs, this does not apply because
* all leaf partition result rels are anyway newly allocated.

34. Comment added which mentions a member that does not exist.

* all_part_cols contains all attribute numbers from the parent that are
* used as partitioning columns by the parent or some descendent which is
* itself partitioned.
*

I've not looked at the test coverage as I see Thomas has been looking
at that in some detail.

I'm going to set this patch as waiting for author now.

Thanks again for working on this.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#196

Alvaro Herrera

alvherre@alvh.no-ip.org

about 8 years ago

In reply to: David Rowley (#193)

Re: [HACKERS] UPDATE of partition key

David Rowley wrote:

5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
do something special when the DELETE/INSERT is a partition move? I
have audit tables in mind here it may appear as though a user
performed a DELETE when they actually performed an UPDATE giving
visibility of this to the trigger function will allow the application
to work around this.

+1 I think we do need a flag that can be inspected from the user
trigger function.

9. Also, this should likely be reworded:

* 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
* this is 0.

'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

Also:

/pgsql/source/master/src/backend/executor/execMain.c: In function 'ExecSetupPartitionTupleRouting':
/pgsql/source/master/src/backend/executor/execMain.c:3401:18: warning: 'leaf_part_arr' may be used uninitialized in this function [-Wmaybe-uninitialized]
leaf_part_rri = leaf_part_arr + i;
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~

I think using num_update_rri==0 as a flag to indicate INSERT is strange.
I suggest passing an additional boolean -- or maybe just split the whole
function in two, one for updates and another for inserts, say
ExecSetupPartitionTupleRoutingForInsert() and
ExecSetupPartitionTupleRoutingForUpdate(). They seem to
share almost no code, and the current flow is hard to read; maybe just
add a common subroutine for the lower bottom of the loop.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#197

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: David Rowley (#193)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

Thanks David Rowley, Alvaro Herrera and Thomas Munro for stepping in
for the reviews !

In the attached patch v24, I have addressed Amit Langote's remaining
review points, and David Rowley's comments upto point #26.

Yet to address :
Robert's few suggestions.
All of Alvaro's comments.
David's points from #27 to #34.
Thomas's point about adding remaining test coverage on transition tables.

Below has the responses for both Amit's and David's comments, starting
with Amit's ....

===============

On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/10/24 0:15, Amit Khandekar wrote:

On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
== NULL))))

Is there some reason why a bitwise operator is used here?

That exact condition means that the function is called for transition
capture for updated rows being moved to another partition. For this
scenario, either the oldtup or the newtup is NULL. I wanted to exactly
capture that condition there. I think the bitwise operator is more
user-friendly in emphasizing the point that it is indeed an "either a
or b, not both" condition.

I see. In that case, since this patch adds the new condition, a note
about it in the comment just above would be good, because the situation
you describe here seems to arise only during update-tuple-routing, IIUC.

Done. Please check.

+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *      instead of allocating new ones while generating the array of all leaf
+ *      partition result rels.
Instead of:

"These are re-used instead of allocating new ones while generating the
array of all leaf partition result rels."

how about:

"There is no need to allocate a new ResultRellInfo entry for leaf
partitions for which one already exists in this array"

Ok. I have made it like this :

+ * 'update_rri' contains the UPDATE per-subplan result rels. For the
output param
+ *             'partitions', we don't allocate new ResultRelInfo objects for
+ *             leaf partitions for which they are already available
in 'update_rri'.

ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
interface. I guess it could simply have the following interface:

static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
HeapTuple tuple, bool is_update);

And figure out, based on the value of is_update, which map to use and
which slot to set *p_new_slot to (what is now "new_slot" argument).
You're getting mtstate here anyway, which contains all the information you
need here. It seems better to make that (selecting which map and which
slot) part of the function's implementation if we're having this function
at all, imho. Maybe I'm missing some details there, but my point still
remains that we should try to put more logic in that function instead of
it just do the mechanical tuple conversion.

I tried to see how the interface would look if we do that way. Here is
how the code looks :

static TupleTableSlot *
ConvertPartitionTupleSlot(ModifyTableState *mtstate,
bool for_update_tuple_routing,
int map_index,
HeapTuple *tuple,
TupleTableSlot *slot)
{
TupleConversionMap *map;
TupleTableSlot *new_slot;

if (for_update_tuple_routing)
{
map = mtstate->mt_persubplan_childparent_maps[map_index];
new_slot = mtstate->mt_rootpartition_tuple_slot;
}
else
{
map = mtstate->mt_perleaf_parentchild_maps[map_index];
new_slot = mtstate->mt_partition_tuple_slot;
}

if (!map)
return slot;

*tuple = do_convert_tuple(*tuple, map);

/*
* Change the partition tuple slot descriptor, as per converted tuple.
*/
ExecSetSlotDescriptor(new_slot, map->outdesc);
ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);

return new_slot;
}

It looks like the interface does not much simplify, and above that, we
have more number of lines in that function. Also, the caller anyway
has to be aware whether the map_index is the index into the leaf
partitions or the update subplans. So it is not like the caller does
not have to be aware about whether the mapping should be
mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

Hmm, I think we should try to make it so that the caller doesn't have to
be aware of that. And by caller I guess you mean ExecInsert(), which
should not be a place, IMHO, where to try to introduce a lot of new logic
specific to update tuple routing.

I think, for ExecInsert() since we have already given the job of
routing the tuple from root partitioned table to a partition, it makes
sense to give the function the additional job of routing the tuple
from any partition to any partition. ExecInsert() can be looked at as
doing this job : "insert a tuple into the right partition; the
original tuple can belong to any partition"

With that, now there are no persubplan and perleaf arrays for ExecInsert()
to pick from to select a map to pass to ConvertPartitionTupleSlot(), or
maybe even no need for the separate function. The tuple-routing code
block in ExecInsert would look like below (writing resultRelInfo as just Rel):

rootRel = (mtstate->rootRel != NULL) ? mtstate->rootRel : Rel

if (rootRel != Rel) /* update tuple-routing active */
{
int subplan_off = Rel - mtstate->Rel[0];
int leaf_off = mtstate->mt_subplan_partition_offsets[subplan_off];

if (mt_transition_tupconv_maps[leaf_off])
{
/*
* Convert to root format using
* mt_transition_tupconv_maps[leaf_off]
*/

slot = mt_root_tuple_slot; /* for tuple-routing */

/* Store the converted tuple into slot */
}
}

/* Existing tuple-routing flow follows */
new_leaf = ExecFindPartition(rootRel, slot, ...)

if (mtstate->transition_capture)
{
transition_capture_map = mt_transition_tupconv_maps[new_leaf]
}

if (mt_partition_tupconv_maps[new_leaf])
{
/*
* Convert to leaf format using mt_partition_tupconv_maps[new_leaf]
*/

slot = mt_partition_tuple_slot;

/* Store the converted tuple into slot */
}

After doing the changes for the int[] array map in the previous patch
version, I still feel that ConvertPartitionTupleSlot() should be
retained. We save some repeated lines of code saved.

On HEAD, the "parent Plan" refers to
mtstate->mt_plans[0]. Now in the patch, for the parent node in
ExecInitQual(), mtstate->ps is passed rather than mt_plans[0]. So the
parent plan refers to this mtstate node.

Hmm, I'm not really sure if doing that (passing mtstate->ps) would be
accurate. In the update tuple routing case, it seems that it's better to
pass the correct parent PlanState pointer to ExecInitQual(), that is, one
corresponding to the partition's sub-plan. At least I get that feeling by
looking at how parent is used downstream to that ExecInitQual() call, but
there *may* not be anything to worry about there after all. I'm unsure.

BTW, the reason I had changed the parent node to mtstate->ps is :
Other places in that code use mtstate->ps while initializing
expressions :

/*
* Build a projection for each result rel.
*/
resultRelInfo->ri_projectReturning =
ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
resultRelInfo->ri_RelationDesc->rd_att);

...........

/* build DO UPDATE WHERE clause expression */
if (node->onConflictWhere)
{
ExprState *qualexpr;

qualexpr = ExecInitQual((List *) node->onConflictWhere,
&mtstate->ps);
....
}

I think wherever we initialize expressions belonging to a plan, we
should use that plan as the parent. WithCheckOptions are fields of
ModifyTableState.

You may be right, but I see for WithCheckOptions initialization
specifically that the non-tuple-routing code passes the actual sub-plan
when initializing the WCO for a given result rel.

Yes that's true. The problem with WithCheckOptions for newly allocated
partition result rels is : we can't use a subplan for the parent
parameter because there is no subplan for it. But I will still think
on it a bit more (TODO).

Comments on the optimizer changes:

+get_all_partition_cols(List *rtables,

Did you mean rtable?

I did mean rtables. It's a list of rtables.

It's not, AFAIK. rtable (range table) is a list of range table entries,
which is also what seems to get passed to get_all_partition_cols for that
argument (root->parse->rtable, which is not a list of lists).

Moreover, there are no existing instances of this naming within the
planner other than those that this patch introduces:

$ grep rtables src/backend/optimizer/
planner.c:114: static void get_all_partition_cols(List *rtables,
planner.c:1063: get_all_partition_cols(List *rtables,
planner.c:1069: Oid root_relid = getrelid(root_rti, rtables);
planner.c:1078: Oid relid = getrelid(rti, rtables);

OTOH, dependency.c does have rtables, but it's actually a list of range
tables. For example:

dependency.c:1360: context.rtables = list_make1(rtable);

Yes, Ok. To be consistent with naming convention at multiple places, I
have changed it to rtable.

+       if (partattno != 0)
+           child_keycols =
+               bms_add_member(child_keycols,
+                              partattno -
FirstLowInvalidHeapAttributeNumber);
+   }
+   foreach(lc, partexprs)
+   {
Elsewhere (in quite a few places), we don't iterate over partexprs
separately like this, although I'm not saying it is bad, just different
from other places.
I think you are suggesting we do it like how it's done in
is_partition_attr(). Can you please let me know other places we do
this same way ? I couldn't find.
OK, not as many as I thought there would be, but there are following
beside is_partition_attrs():

partition.c: get_range_nulltest()
partition.c: get_qual_for_range()
relcache.c: RelationBuildPartitionKey()

Ok, I think I will first address Robert's suggestion of re-using
is_partition_attrs() for pull_child_partition_columns(). If I do that,
this discussion won't be applicable, so I am deferring this one.
(TODO)

=============

Below are my responses to David's comments upto point #26 :

On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 10 November 2017 at 16:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
[ update-partition-key_v23.patch ]

Hi Amit,

Thanks for working on this. I'm looking forward to seeing this go in.

So... I've signed myself up to review the patch, and I've just had a
look at it, (after first reading this entire email thread!).

Thanks a lot for your extensive review.

Overall the patch looks like it's in quite a good shape.

Nice to hear that.

I think I do agree with Robert about the UPDATE anomaly that's been discussed.
I don't think we're painting ourselves into any corner by not having
this working correctly right away. Anyone who's using some trigger
workarounds for the current lack of support for updating the partition
key is already going to have the same issues, so at least this will
save them some troubles implementing triggers and give them much
better performance.

I believe you are referring to the concurrency anomaly. Yes I agree on
that. By the way, (you may be already aware), there is a separate mail
thread going on to address this anamoly, so that we don't silently
proceed with the UPDATE without error :

/messages/by-id/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw@mail.gmail.com

1. Closing command tags in docs should not be abbreviated

triggers are concerned, <literal>AFTER</> <command>DELETE</command> and

This changed in c29c5789. I think Peter will be happy if you don't
abbreviate the closing tags.

Added the tag. I had done most of the corrections after I rebased over
this commit, but I think I missed some of those with <literal> tag.

2. "about to do" would read better as "about to perform"

concurrent session, and it is about to do an <command>UPDATE</command>

I think this paragraph could be more clear if we identified the
sessions with a number.

Perhaps:
Suppose, session 1 is performing an <command>UPDATE</command> on a
partition key, meanwhile, session 2 tries to perform an <command>UPDATE
</command> or <command>DELETE</command> operation on the same row.
Session 2 can silently miss the row due to session 1's activity. In
such a case, session 2's <command>UPDATE</command>/<command>DELETE
</command>, being unaware of the row's movement, interprets this that the
row has just been deleted, so there is nothing to be done for this row.
Whereas, in the usual case where the table is not partitioned, or where
there is no row movement, the second session would have identified the
newly updated row and carried <command>UPDATE</command>/<command>DELETE
</command> on this new row version.

Done like above, with slight changes.

3. Integer width. get_partition_natts returns int but we assign to int16.

int16 partnatts = get_partition_natts(key);

Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber
but that's existingly not correct.

4. The following code could just pull_varattnos(partexprs, 1, &child_keycols);

foreach(lc, partexprs)
{
Node *expr = (Node *) lfirst(lc);

pull_varattnos(expr, 1, &child_keycols);
}

I will defer this till I address Robert's request to try and see if we
can have a common code for pull_child_partition_columns() and
is_partition_attr(). (TODO)

5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
do something
special when the DELETE/INSERT is a partition move? I have audit
tables in mind here
it may appear as though a user performed a DELETE when they actually
performed an UPDATE
giving visibility of this to the trigger function will allow the
application to work around this.

I feel it's too early to add a user-visible variable for such purpose.
Currently we don't support triggers on partitioned tables, and so a
user who wants to have a common trigger for a partition subtree has no
choice but to install the same trigger on all the leaf partitions
under it. And so we have to live with a not-very-obvious behaviour of
firing triggers even for the delete/insert part of the update row
movement.

6. change "row" to "a row" and "old" to "the old"

* depending on whether the event is for row being deleted from old

But to be honest, I'm having trouble parsing the comment. I think it
would be better to
say explicitly when the row will be NULL rather than "depending on
whether the event"

I have put it this way now :

* For INSERT events newtup should be non-NULL, for DELETE events
* oldtup should be non-NULL, whereas for UPDATE events normally both
* oldtup and newtup are non-NULL. But for UPDATE event fired for
* capturing transition tuples during UPDATE partition-key row
* movement, oldtup is NULL when the event is for row being inserted,
* whereas newtup is NULL when the event is for row being deleted.

7. I'm confused with how this change came about. If the old comment
was correct here then the comment you're referring to here should
remain in ExecPartitionCheck(), but you're saying it's in
ExecConstraints().

/* See the comments in ExecConstraints. */

If the comment really is in ExecConstraints(), then you might want to
give an overview of what you mean, then reference ExecConstraints() if
more details are required.

I have put it this way :
* Need to first convert the tuple to the root partitioned table's row
* type. For details, check similar comments in ExecConstraints().

Basically, the comment to be referred in ExecConstraints() is this :
* If the tuple has been routed, it's been converted to the
* partition's rowtype, which might differ from the root
* table's. We must convert it back to the root table's
* rowtype so that val_desc shown error message matches the
* input tuple.

8. I'm having trouble parsing this comment:

* 'update_rri' has the UPDATE per-subplan result rels.

I think "has" should be "contains" ?

Ok, changed it to 'contains'.

9. Also, this should likely be reworded:

* 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
* this is 0.

'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

Done.

10. There should be no space before the '?'

/* Is this leaf partition present in the update resultrel ? */

Done.

11. I'm struggling to understand this comment:

* This is required when converting tuple as per root
* partition tuple descriptor.

"tuple" should probably be "the tuple", but not quite sure what you
mean by "as per root".

I may have misunderstood, but maybe it should read:

* This is required when we convert the partition's tuple to
* be compatible with the partitioned table's tuple descriptor.

ri_PartitionRoot is set to NULL while creating the result rels for
each of the UPDATE subplans; and it is required to be set to the root
table for leaf partitions created for tuple routing so that the error
message displays the row in root tuple descriptor. Because we re-use
the same result rels for the per-partition array, we need to set it
for them here.

I have reworded the comment this way :

* This is required when we convert the partition's tuple to be
* compatible with the root partitioned table's tuple
* descriptor. When generating the per-subplan UPDATE result
* rels, this was not set.

Let me know if this is clear enough.

12. I think "as well" would be better written as "either".

* If we didn't open the partition rel, it means we haven't
* initialized the result rel as well.

Done.

13. I'm unsure what is meant by the following comment:

* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.

I'm not quite sure where UPDATE comes in here as we're only checking for INSERT?

Here, "Even for update" means "Even when
ExecSetupPartitionTupleRouting() is called for an UPDATE operation".

14. Use of underscores instead of camelCase.

COPY_SCALAR_FIELD(part_cols_updated);

I know you're not the first one to break this as "partitioned_rels"
does not follow it either, but that's probably not a good enough
reason to break away from camelCase any further.

I'd suggest "partColsUpdated". But after a re-think, maybe cols is
incorrect. All columns are partitioned, it's the key columns that we
care about, so how about "partKeyUpdate"

Sure. I have used partKeyUpdated as against partKeyUpdate.

15. Are you sure that you mean "root" here?

* All the child partition attribute numbers are converted to the root
* partitioned table.

Surely this is just the target relation. "parent" maybe? A
sub-partitioned table might be the target of an UPDATE too.

Here the root means the root of the partition subtree, which is also
the UPDATE target relation. I think in other places we call it the
root even though it may also have ancestors. It is the root of the
subtree in question. This is similar to how we have named the
ModifyTableState->rootResultRelInfo field.

Note that Robert has requested to collect the partition cols at some
other place where we have already the table open. So this function
itself may change.

15. I see get_all_partition_cols() is just used once to check if
parent_rte->updatedCols contains and partition keys.

Would it not be better to reform that function and pass
parent_rte->updatedCols in and abort as soon as you see a single
match?

Maybe the function could return bool and be named
partitioned_key_overlaps(), that way your assignment in
inheritance_planner() would just become:

part_cols_updated = partitioned_key_overlaps(root->parse->rtable,
top_parentRTindex, partitioned_rels, parent_rte->updatedCols);

or something like that anyway.

I am going to think on all of this when I start checking if we can
have some common code for pull_child_partition_columns() and
is_partition_attr(). (TODO)

One thing to note is : Usually the user is not going to modify
partition cols. So typically we would need to scan through all the
partitioned tables to check if the partition key is modified. So to
make this scan more efficient, avoid the "bitmap_overlap" operation
for each of the partitioned tables separately, and instead, collect
them first from all partitioned tables, and then do a single overlap
operation. This way we make the normal updates a tiny bit fast, at the
expense of tiny-bit slower partition-key-updates because we don't
abort the scan as soon as we find the partition key updated.

16. Typo in comment

* 'part_cols_updated' if any partitioning columns are being updated, either
* from the named relation or a descendent partitione table.

"partitione" should be "partitioned". Also, normally for bool
parameters, we might word things like "True if ..." rather than just "if"

You probably should follow camelCase I mentioned in 14 here too.

Done. Similar to the other bool param canSetTag, made it :
"'partColsUpdated' is true if any ..."

17. Comment needs a few changes:

* ConvertPartitionTupleSlot -- convenience function for converting tuple and
* storing it into a tuple slot provided through 'new_slot', which typically
* should be one of the dedicated partition tuple slot. Passes the partition
* tuple slot back into output param p_old_slot. If no mapping present, keeps
* p_old_slot unchanged.
*
* Returns the converted tuple.

There are a few typos here. For example, "tuple" should be "a tuple",
but maybe the comment should just be worded like:

* ConvertPartitionTupleSlot -- convenience function for tuple conversion
* using 'map'. The tuple, if converted, is stored in 'new_slot' and
* 'p_old_slot' is set to the original partition tuple slot. If map is NULL,
* then the original tuple is returned unmodified, otherwise the converted
* tuple is returned.

Modified, with some changes. p_old_slot name is a bit confusing. So I
have renamed it to p_my_slot.
Here is how it looks now :

* ConvertPartitionTupleSlot -- convenience function for tuple conversion using
* 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
* updated with the 'new_slot'. 'new_slot' typically should be one of the
* dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
*
* Returns the converted tuple, unless map is NULL, in which case original
* tuple is returned unmodified.

18. Line goes over 80 chars.

TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

Better just to split the declaration and assignment.

Done.

19. Confusing comment:

/*
* If the original operation is UPDATE, the root partitioned table
* needs to be fetched from mtstate->rootResultRelInfo.
*/

It's not that clear here how you determine this is an UPDATE of a
partitioned key.

20. This code looks convoluted:

rootResultRelInfo = (mtstate->rootResultRelInfo ?
mtstate->rootResultRelInfo : resultRelInfo);

/*
* If the resultRelInfo is not the root partitioned table (which
* happens for UPDATE), we should convert the tuple into root's tuple
* descriptor, since ExecFindPartition() starts the search from root.
* The tuple conversion map list is in the order of
* mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
* we need to know the position of the resultRel in
* mtstate->resultRelInfo[].
*/
if (rootResultRelInfo != resultRelInfo)
{

rootResultRelInfo is assigned via a ternary expression which makes the
subsequent if test seem a little strange.

Would it not be better to test:

if (mtstate->rootResultRelInfo)
{
rootResultRelInfo = mtstate->rootResultRelInfo
... other stuff ...
}
else
rootResultRelInfo = resultRelInfo;

Then above the if test you can explain that rootResultRelInfo is only
set during UPDATE of partition keys, as per #19.

Giving more thought on this, I think to avoid confusion to the reader,
we better have an explicit (operation == CMD_UPDATE) condition, and in
that block, assert that mtstate->rootResultRelInfo is non-NULL. I have
accordingly shuffled the if conditions. I think this is simple and
clear. Please check.

21. How come you renamed mt_partition_tupconv_maps[] to
mt_parentchild_tupconv_maps[]?

mt_transition_tupconv_maps must be renamed to a more general map name
because it is not only used for transition capture but also for update
tuple routing. And we have mt_partition_tupconv_maps which is already
a general name. So to distinguish between the two tupconv maps, I
prepended "parent-child" or "child-parent" to "tupconv_maps".

22. Comment in ExecInsert() could be worded better.

/*
* In case this is part of update tuple routing, put this row into the
* transition NEW TABLE if we are capturing transition tables. We need to
* do this separately for DELETE and INSERT because they happen on
* different tables.
*/

/*
* This INSERT may be the result of a partition-key-UPDATE. If so,
* and we're required to capture transition tables then we'd better
* record this as a statement level UPDATE on the target relation.
* We're not interested in the statement level DELETE or INSERT as
* these occur on the individual partitions, none of which are the
* target of this the UPDATE statement.
*/

A similar comment could use a similar improvement in ExecDelete()

I want to emphasize the fact that we need to do the OLD and NEW row
separately for DELETE and INSERT. And also, I think we need not
mention about statement triggers, though the transition table capture
with partitions currently is supported only for statement triggers. We
should only worry about capturing the row if
mtstate->mt_transition_capture != NULL, without having to know whether
it is for statement trigger or not.

Below is how the comment looks now after I did some changes as per
your suggestion about wording :

* If this INSERT is part of a partition-key-UPDATE and we are capturing
* transition tables, put this row into the transition NEW TABLE.
* (Similarly we need to add the deleted row in OLD TABLE). We need to do
* this separately for DELETE and INSERT because they happen on different
* tables.

23. Line is longer than 80 chars.

TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

Done.

24. I know from reading the thread this name has changed before, but I
think delete_skipped seems like the wrong name for this variable in:

if (delete_skipped)
*delete_skipped = true;

Skipped is the wrong word here as that indicates like we had some sort
of choice and that we decided not to. However, that's not the case
when the tuple was concurrently deleted. Would it not be better to
call it "tuple_deleted" or even "success" and reverse the logic? It's
just a bit confusing that you're setting this to skipped before
anything happens. It would be nicer if there was a better way to do
this whole thing as it's a bit of a wart in the code. I understand why
the code exists though.

I think "success" sounds like : if it is false, ExecDelete has failed.
So I have chosen "tuple_deleted". "tuple_actually_deleted" might sound
still better, but it is too long.

Also, I wonder if it's better to always pass a boolean here to save
having to test for NULL before setting it, that way you might consider
putting the success = false just before the return NULL, then do
success = true after the tuple is gone.
Failing that, putting: something like: success = false; /* not yet! */
where you're doing the if (deleted_skipped) test is might also be
better.

I didn't really understand this.

25. Comment "we should" should be "we must".

/*
* For some reason if DELETE didn't happen (for e.g. trigger
* prevented it, or it was already deleted by self, or it was
* concurrently deleted by another transaction), then we should
* skip INSERT as well, otherwise, there will be effectively one
* new row inserted.

Maybe just:
/* If the DELETE operation was unsuccessful, then we must not
* perform the INSERT into the new partition.

I think we better mention some scenarios of why this can happen ,
otherwise its confusing to the reader why the delete can't happen, or
why we shouldn't error out in that case.

"for e.g." is not really correct in English. "For example, ..." or
just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've
written "For exempli gratia", which translates to "For for example".

I see. Good to know that. Done.

26. You're not really explaining what's going on here:

if (mtstate->mt_transition_capture)
saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

You have a comment later to say you're about to "Revert back to the
transition capture map", but I missed the part that explained about
modifying it in the first place.

I have now added main comments while saving the map, and I refer to
this comment while reverting back the map.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v24.patchapplication/octet-stream; name=update-partition-key_v24.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index daba66c..a6e6160 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3297,9 +3302,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 0e99aa9..f525e16 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index b0e160a..479d4e2 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by <command>INSERT</command> into the
+    new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index ce29ba2..2e9ce81 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1440,7 +1440,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1453,8 +1454,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1463,14 +1464,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2578,6 +2579,79 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 }
 
 /*
+ * pull_child_partition_columns
+ *
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the 'partcols' bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int			partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*partcols =
+			bms_add_member(*partcols,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index d6b235c..2854f21 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2479,11 +2479,14 @@ CopyFrom(CopyState cstate)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
 									   &partition_tupconv_maps,
+									   NULL,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		cstate->partition_dispatch_info = partition_dispatch_info;
@@ -2749,7 +2752,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..73ec872 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for row being inserted,
+		 * whereas newtup is NULL when the event is for row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,17 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so either can be NULL, not both.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dbaa47f..5ec92d5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d275cef..0df3c27 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -41,6 +41,13 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' contains the UPDATE per-subplan result rels. For the output
+ *		param 'partitions', we don't allocate new ResultRelInfo objects for
+ *		leaf partitions for which they are already available in 'update_rri'.
+ *
+ * 'num_update_rri' is the number of elements in 'update_rri' array or zero for
+ *      INSERT.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
@@ -64,11 +71,14 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
 							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
+							   int **subplan_leaf_map,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -76,7 +86,8 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr;
+	int			update_rri_index = 0;
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -85,11 +96,45 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
+	if (subplan_leaf_map)
+		*subplan_leaf_map = NULL;
 	*partitions = (ResultRelInfo **) palloc(*num_partitions *
 											sizeof(ResultRelInfo *));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (num_update_rri != 0)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		*subplan_leaf_map = palloc(num_update_rri * sizeof(int));
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -98,20 +143,67 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (num_update_rri != 0)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				(*subplan_leaf_map)[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -121,14 +213,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -144,9 +232,15 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(num_update_rri == 0 || update_rri_index == num_update_rri);
 }
 
 /*
@@ -177,8 +271,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 503b89f..ac1dc67 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +251,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +308,9 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -282,17 +327,47 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * For UPDATE, the resultRelInfo is not the root partitioned table, so
+		 * we should convert the tuple into root's tuple descriptor, since
+		 * ExecFindPartition() starts the search from root.  The tuple
+		 * conversion map list is in the order of mtstate->resultRelInfo[], so
+		 * to retrieve the one for this resultRel, we need to know the position
+		 * of the resultRel in mtstate->resultRelInfo[].
+		 */
+		if (mtstate->operation == CMD_UPDATE)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+			TupleConversionMap *tupconv_map;
+
+			Assert(mtstate->rootResultRelInfo != NULL);
+			rootResultRelInfo = mtstate->rootResultRelInfo;
+
+			/* resultRelInfo must be one of the per-subplan result rels. */
+			Assert(resultRelInfo >= mtstate->resultRelInfo &&
+				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
+
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  mtstate->mt_root_tuple_slot,
+											  &slot);
+		}
+		else
+			rootResultRelInfo = resultRelInfo;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_parentchild_tupconv_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -331,8 +406,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -345,30 +422,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_parentchild_tupconv_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -486,7 +554,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -622,9 +690,32 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tables, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -678,6 +769,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tuple_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -685,6 +778,12 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
+
+	if (tuple_deleted)
+		*tuple_deleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -849,12 +948,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (tuple_deleted)
+		*tuple_deleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -947,6 +1073,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1043,12 +1170,87 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip INSERT as
+			 * well, otherwise, there will be effectively one new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * UPDATEs set the transition capture map only when a new subplan
+			 * is chosen.  But for INSERTs, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INESRT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Now revert back the transition capture map. See the above
+				 * comments.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We have already checked partition constraints above, so skip
+		 * checking them here.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1476,7 +1678,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1505,55 +1706,113 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 							 mtstate->mt_num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(mtstate->mt_partition_dispatch_info != NULL));
+
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+		return;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based
+		 * on the partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+
+		Assert(mtstate->mt_subplan_partition_offsets != NULL);
+		leaf_index = mtstate->mt_subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < mtstate->mt_num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1660,15 +1919,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1784,7 +2041,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1829,9 +2087,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1904,6 +2165,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
+		 * need to do update tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1941,31 +2211,51 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Decide whether we need to perform update tuple routing. */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
+		int *subplan_leaf_map;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   mtstate->resultRelInfo,
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
 									   &partition_tupconv_maps,
+									   &subplan_leaf_map,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WCO and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1976,6 +2266,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the partition attnos to the root attno.
+	 * This is required when during update row movement the tuple descriptor of
+	 * a source partition does not match the root partitioned table descriptor.
+	 * In such a case we need to convert tuples to the root tuple descriptor,
+	 * because the search for destination partition starts from the root.  Skip
+	 * this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -2005,26 +2307,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2033,17 +2338,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2060,7 +2374,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2097,22 +2411,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2357,6 +2684,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2391,11 +2719,23 @@ ExecEndModifyTable(ModifyTableState *node)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * all leaf partition result rels are anyway newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_root_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 76e7545..f86a140 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dc35df9..90a512e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2100,6 +2101,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 593658d..40b3a90 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9c74e39..4bbf192 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2371,6 +2372,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6427,6 +6429,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6453,6 +6456,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4c00a14..9ef2d24 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -112,6 +112,10 @@ typedef struct
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
+static void get_all_partition_cols(List *rtable,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols);
 static void inheritance_planner(PlannerInfo *root);
 static void grouping_planner(PlannerInfo *root, bool inheritance_update,
 				 double tuple_fraction);
@@ -1049,6 +1053,40 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
 }
 
 /*
+ * get_all_partition_cols
+ *	  Get attribute numbers of all partition key columns of all the partitioned
+ *    tables.
+ *
+ * All the child partition attribute numbers are converted to the root
+ * partitioned table.
+ */
+static void
+get_all_partition_cols(List *rtable,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols)
+{
+	ListCell   *lc;
+	Oid			root_relid = getrelid(root_rti, rtable);
+	Relation	root_rel;
+
+	/* The caller must have already locked all the partitioned tables. */
+	root_rel = heap_open(root_relid, NoLock);
+	*all_part_cols = NULL;
+	foreach(lc, partitioned_rels)
+	{
+		Index		rti = lfirst_int(lc);
+		Oid			relid = getrelid(rti, rtable);
+		Relation	part_rel = heap_open(relid, NoLock);
+
+		pull_child_partition_columns(part_rel, root_rel, all_part_cols);
+		heap_close(part_rel, NoLock);
+	}
+
+	heap_close(root_rel, NoLock);
+}
+
+/*
  * inheritance_planner
  *	  Generate Paths in the case where the result relation is an
  *	  inheritance set.
@@ -1093,6 +1131,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1163,10 +1202,23 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
 		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		/*
+		 * Retrieve the partition key columns of all the partitioned tables,
+		 * so as to check whether any of the columns being updated is
+		 * a partition key of any of the partition tables.
+		 */
+		get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+							   partitioned_rels, &all_part_cols);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			partColsUpdated = true;
 	}
 
 	/*
@@ -1504,6 +1556,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2121,6 +2174,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 68dee0f..0ce5339 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3207,6 +3207,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the named relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3220,6 +3222,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3287,6 +3290,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 295e9d2..0e5922d 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -54,12 +54,14 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
-
+extern void pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 64e5aab..41be2cf 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -50,11 +50,14 @@ typedef struct PartitionDispatchData
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
 							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
+							   int **subplan_leaf_map,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index bee4ebf..0a2e76e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e05bc04..d2e8060 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -982,15 +982,19 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_root_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_parentchild_tupconv_maps;
+	/* Per partition map for tuple conversion from root to leaf */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
+	int		*mt_subplan_partition_offsets;
+	/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a127682..e80fef2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9e68e65..ee6ceb0 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2109,6 +2110,10 @@ typedef struct AppendRelInfo
  * The child_rels list must contain at least one element, because the parent
  * partitioned table is itself counted as a child.
  *
+ * all_part_cols contains all attribute numbers from the parent that are
+ * used as partitioning columns by the parent or some descendent which is
+ * itself partitioned.
+ *
  * These structs are kept in the PlannerInfo node's pcinfo_list.
  */
 typedef struct PartitionedChildRelInfo
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..39ce47d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index a4fe961..50b76cf 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,367 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (null, 85, b, 15, 105).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, b, 7, 2).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
+Partition constraint: (NOT (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +566,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +629,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +755,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..a07f113 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,229 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +338,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +367,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +466,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#198

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: David Rowley (#195)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

The following contains replies to David's remaining comments , i.e.
from #27 onwards, followed by replies to Alvaro's review comments.

Attached is the revised patch v25.

=====================

On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote:

27. Comment does not explain how we're skipping checking the partition
constraint check in:

* We have already checked partition constraints above, so skip
* checking them here.

Maybe something like:

* We've already checked the partition constraint above, however, we
* must still ensure the tuple passes all other constraints, so we'll
* call ExecConstraints() and have it validate all remaining checks.

Done.

28. For table WITH OIDs, the OID should probably follow the new tuple
for partition-key-UPDATEs.

I understand that as far as possible we want to simulate the UPDATE as
if it's a normal table update. But for system columns, I think we
should avoid that; and instead, let the system handle it the way it is
handling (i.e. the new row in a table should always have a new OID.)

29. ExecSetupChildParentMap gets called here for non-partitioned relations.
Maybe that's not the best function name? The function only seems to do
that when perleaf is True.

I didn't clearly understand this, particularly, what task you were
referring to when you said "the function only seems to do that" ? The
function does setup child-parent map even when perleaf=false. The
function name is chosen that way because the map is always a
child-to-root map, but the map array elements may be arranged in the
order of the per-partition array 'mtstate->mt_partitions[]', or in the
order of the per-subplan result rels 'mtstate->resultRelInfo[]'

Is a leaf a partition of a partitioned table? It's not that clear the
meaning here.

Leaf partition means it is a child of a partitioned table, but it
itself is not a partitioned table.

I have added more comments for the function ExecSetupChildParentMap()
(both, at the function header and inside). Please check and let me
know if you still have questions.

30. The following chunk of code is giving me a headache trying to
verify which arrays are which size:

ExecSetupPartitionTupleRouting(rel,
mtstate->resultRelInfo,
(operation == CMD_UPDATE ? nplans : 0),
node->nominalRelation,
estate,
&partition_dispatch_info,
&partitions,
&partition_tupconv_maps,
&subplan_leaf_map,
&partition_tuple_slot,
&num_parted, &num_partitions);
mtstate->mt_partition_dispatch_info = partition_dispatch_info;
mtstate->mt_num_dispatch = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
mtstate->mt_partition_tuple_slot = partition_tuple_slot;
mtstate->mt_root_tuple_slot = MakeTupleTableSlot();

I know this patch is not completely responsible for it, but you're not
making things any better.

Would it not be better to invent some PartitionTupleRouting struct and
make that struct a member of ModifyTableState and CopyState, then just
pass the pointer to that struct to ExecSetupPartitionTupleRouting()
and have it fill in the required details? I think the complexity of
this is already on the high end, I think you really need to do the
refactor before this gets any worse.

Ok. I am currently working on doing this change. So not yet included
in the attached patch. Will send yet another revised patch for this
change. (TODO)

31. The following code seems incorrect:

/*
* If this is an UPDATE and a BEFORE UPDATE trigger is present, we may
* need to do update tuple routing.
*/
if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

Shouldn't this be setting update_tuple_routing_needed to false if
there are no before row update triggers? Otherwise, you're setting it
to true regardless of if there are any partition key columns being
UPDATEd. That would make the work you're doing in
inheritance_planner() to set part_cols_updated a waste of time.

The point of setting it to true regardless of whether the partition
key is updated is : even if partition key is not explicitly modified
by the UPDATE, a before-row trigger can update it later. But we can
never know whether it is actually going to update. So if there are BR
UPDATE triggers on result rels of any of the subplans, we *always*
setup the tuple routing. This approach was concluded in the earlier
discussions about trigger handling.

Also, this bit of code is a bit confused.

/* Decide whether we need to perform update tuple routing. */
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
update_tuple_routing_needed = false;

/*
* Build state for tuple routing if it's an INSERT or if it's an UPDATE of
* partition key.
*/
if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
(operation == CMD_INSERT || update_tuple_routing_needed))

The first if test would not be required if you fixed the code where
you set update_tuple_routing_needed = true regardless if its a
partitioned table or not.

The place where I set update_tuple_routing_needed to true
unconditionally, we don't have the relation open, so we don't know
whether it is a partitioned table. Hence, set it anyways, and then
revert it to false if it's not a partitioned table after all.

So basically, you need to take the node->part_cols_updated from the
planner, if that's true then perform your test for before row update
triggers, set a bool to false if there are none, then proceed to setup
the partition tuple routing for partition table inserts or if your
bool is still true. Right?

I think if we look at "update_tuple_routing_needed" as meaning that
update tuple routing *may be* required, then the logic as-is makes
sense: Set the variable if we see that we may require to do update
routing. And the conditions for that are : either node->partKeyUpdated
is true, or there is a BR UPDATE trigger and the operation is UPDATE.
So set this variable for those conditions, and revert it back to false
later if it is found that it's not a partitioned table.

So I have retained the existing logic in the patch, but with some
additional comments to make this logic clear to the reader.

32. "WCO" abbreviation is not that common and might need to be expanded.

* Below are required as reference objects for mapping partition
* attno's in expressions such as WCO and RETURNING.

Searching for other comments which mention "WCO" they're all around
places that is easy to understand they mean "With Check Option", e.g.
next to a variable with a more descriptive name. That's not the case
here.

Ok. Changed WCO to WithCheckOptions.

33. "are anyway newly allocated", should "anyway" be "always"?
Otherwise, it does not make sense.

OK. Changed this :
* because all leaf partition result rels are anyway newly allocated.
to this (also removed 'all') :
* because leaf partition result rels are always newly allocated.

34. Comment added which mentions a member that does not exist.

* all_part_cols contains all attribute numbers from the parent that are
* used as partitioning columns by the parent or some descendent which is
* itself partitioned.
*

Oops. Left-overs from earlier patch. Removed the comment.

=====================

Below are Alvaro's review comments :

On 14 November 2017 at 22:22, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

David Rowley wrote:

5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
do something special when the DELETE/INSERT is a partition move? I
have audit tables in mind here it may appear as though a user
performed a DELETE when they actually performed an UPDATE giving
visibility of this to the trigger function will allow the application
to work around this.

+1 I think we do need a flag that can be inspected from the user
trigger function.

What I feel is : it's too early to do such changes. I think we should
first get in the core patch, and then consider this request and any
further enhancements.

9. Also, this should likely be reworded:

* 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
* this is 0.

'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

Also:

/pgsql/source/master/src/backend/executor/execMain.c: In function 'ExecSetupPartitionTupleRouting':
/pgsql/source/master/src/backend/executor/execMain.c:3401:18: warning: 'leaf_part_arr' may be used uninitialized in this function [-Wmaybe-uninitialized]
leaf_part_rri = leaf_part_arr + i;
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~

Right. I have now made "leaf_part_arr = NULL" during declaration.
Actually leaf_part_arr is used only for inserts; but for compiler-sake
we should add this initialization.

I think using num_update_rri==0 as a flag to indicate INSERT is strange.
I suggest passing an additional boolean --

I think adding another param looks redundant. To make the condition
more readable, I have introduced a new local variable :
bool is_update = (num_update_rri > 0);

or maybe just split the whole
function in two, one for updates and another for inserts, say
ExecSetupPartitionTupleRoutingForInsert() and
ExecSetupPartitionTupleRoutingForUpdate(). They seem to
share almost no code, and the current flow is hard to read; maybe just
add a common subroutine for the lower bottom of the loop.

So there are two common code sections. One is the initial code which
initializes various arrays and output params. And the 2nd common code
is the 2nd half of the for loop block that includes calls to
heap_open(), InitResultRelInfo(), convert_tuples_by_name(),
CheckValidResultRel() and others. So it looks like there is a lot of
common code. We would need to have two functions, one to have the
initialization code, and the other to run the later half of the loop.
And, heap_open() and InitResultRelInfo() need to be called only if
partrel (which needs to be passed as function param) is NULL. Rather
than this, I think this condition is better placed in-line in
ExecSetupPartitionTupleRouting() for clarity. I am feeling it's not
worth doing the shuffling. We are extracting the code into two
functions only to avoid the "if num_update_rri" conditions.

That's why I feel having a "is_update" variable would solve the
purpose. The hard-to-understand code, I presume, is the update part
where it tries to re-use already-existing result resl, and this part
would anyways remain, although in a separate function.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v25.patchapplication/octet-stream; name=update-partition-key_v25.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index daba66c..a6e6160 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3297,9 +3302,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 0e99aa9..f525e16 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index b0e160a..479d4e2 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by <command>INSERT</command> into the
+    new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 67d4c2a..da98106 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1440,7 +1440,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1453,8 +1454,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1463,14 +1464,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2578,6 +2579,79 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 }
 
 /*
+ * pull_child_partition_columns
+ *
+ * For each column of rel which is in the partition key or which appears
+ * in an expression which is in the partition key, translate the attribute
+ * number of that column according to the given parent, and add the resulting
+ * column number to the 'partcols' bitmapset, offset as we frequently do by
+ * FirstLowInvalidHeapAttributeNumber.
+ */
+void
+pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols)
+{
+	PartitionKey key = RelationGetPartitionKey(rel);
+	int			partnatts = get_partition_natts(key);
+	List	   *partexprs = get_partition_exprs(key);
+	ListCell   *lc;
+	Bitmapset  *child_keycols = NULL;
+	int			i;
+	AttrNumber *map;
+	int			child_keycol = -1;
+
+	/*
+	 * First, compute the complete set of partition columns for this rel. For
+	 * compatibility with the API exposed by pull_varattnos, we offset the
+	 * column numbers by FirstLowInvalidHeapAttributeNumber.
+	 */
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+			child_keycols =
+				bms_add_member(child_keycols,
+							   partattno - FirstLowInvalidHeapAttributeNumber);
+	}
+	foreach(lc, partexprs)
+	{
+		Node	   *expr = (Node *) lfirst(lc);
+
+		pull_varattnos(expr, 1, &child_keycols);
+	}
+
+	/*
+	 * Next, work out how to convert from the attribute numbers for the child
+	 * to the attribute numbers for the parent.
+	 */
+	map =
+		convert_tuples_by_name_map(RelationGetDescr(parent),
+								   RelationGetDescr(rel),
+								   gettext_noop("could not convert row type"));
+
+	/*
+	 * For each child key column we have identified, translate to the
+	 * corresponding parent key column.  Entry 0 in the map array corresponds
+	 * to attribute number 1, which corresponds to a bitmapset entry for 1 -
+	 * FirstLowInvalidHeapAttributeNumber.
+	 */
+	while ((child_keycol = bms_next_member(child_keycols, child_keycol)) >= 0)
+	{
+		int			kc = child_keycol + FirstLowInvalidHeapAttributeNumber;
+
+		Assert(kc > 0 && kc <= RelationGetNumberOfAttributes(rel));
+		*partcols =
+			bms_add_member(*partcols,
+						   map[kc - 1] - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/* Release memory. */
+	pfree(map);
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index d6b235c..2854f21 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2479,11 +2479,14 @@ CopyFrom(CopyState cstate)
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
 									   &partition_tupconv_maps,
+									   NULL,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		cstate->partition_dispatch_info = partition_dispatch_info;
@@ -2749,7 +2752,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..73ec872 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for row being inserted,
+		 * whereas newtup is NULL when the event is for row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,17 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so either can be NULL, not both.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dbaa47f..5ec92d5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d275cef..2ac7484 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -41,6 +41,13 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' contains the UPDATE per-subplan result rels. For the output
+ *		param 'partitions', we don't allocate new ResultRelInfo objects for
+ *		leaf partitions for which they are already available in 'update_rri'.
+ *
+ * 'num_update_rri' is the number of elements in 'update_rri' array or zero for
+ *      INSERT.
+ *
  * Output arguments:
  * 'pd' receives an array of PartitionDispatch objects with one entry for
  *		every partitioned table in the partition tree
@@ -64,11 +71,14 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
 							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
+							   int **subplan_leaf_map,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions)
 {
@@ -76,7 +86,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL;
+	int			update_rri_index = 0;
+	bool		is_update = (num_update_rri > 0);
 
 	/*
 	 * Get the information about the partition tree after locking all the
@@ -85,11 +97,45 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
 	*num_partitions = list_length(leaf_parts);
+	if (subplan_leaf_map)
+		*subplan_leaf_map = NULL;
 	*partitions = (ResultRelInfo **) palloc(*num_partitions *
 											sizeof(ResultRelInfo *));
 	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
 													 sizeof(TupleConversionMap *));
 
+	if (is_update)
+	{
+		/*
+		 * For Updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		*subplan_leaf_map = palloc(num_update_rri * sizeof(int));
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -98,20 +144,67 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	*partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				(*subplan_leaf_map)[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -121,14 +214,10 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -144,9 +233,15 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		(*partitions)[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
 }
 
 /*
@@ -177,8 +272,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 503b89f..a0d8259 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +251,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +308,9 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -282,17 +327,47 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		ResultRelInfo *rootResultRelInfo;
+
+		/*
+		 * For UPDATE, the resultRelInfo is not the root partitioned table, so
+		 * we should convert the tuple into root's tuple descriptor, since
+		 * ExecFindPartition() starts the search from root.  The tuple
+		 * conversion map list is in the order of mtstate->resultRelInfo[], so
+		 * to retrieve the one for this resultRel, we need to know the position
+		 * of the resultRel in mtstate->resultRelInfo[].
+		 */
+		if (mtstate->operation == CMD_UPDATE)
+		{
+			int			map_index = resultRelInfo - mtstate->resultRelInfo;
+			TupleConversionMap *tupconv_map;
+
+			Assert(mtstate->rootResultRelInfo != NULL);
+			rootResultRelInfo = mtstate->rootResultRelInfo;
+
+			/* resultRelInfo must be one of the per-subplan result rels. */
+			Assert(resultRelInfo >= mtstate->resultRelInfo &&
+				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
+
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  mtstate->mt_root_tuple_slot,
+											  &slot);
+		}
+		else
+			rootResultRelInfo = resultRelInfo;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * mt_partitions[] and mt_parentchild_tupconv_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(resultRelInfo,
+		leaf_part_index = ExecFindPartition(rootResultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -331,8 +406,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -345,30 +422,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  mtstate->mt_parentchild_tupconv_maps[leaf_part_index],
+										  tuple,
+										  mtstate->mt_partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -486,7 +554,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -622,9 +690,32 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tables, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -678,6 +769,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tuple_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -685,6 +778,12 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
+
+	if (tuple_deleted)
+		*tuple_deleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -849,12 +948,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (tuple_deleted)
+		*tuple_deleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -947,6 +1073,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1043,12 +1170,88 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (mtstate->mt_partition_dispatch_info == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip INSERT as
+			 * well, otherwise, there will be effectively one new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * UPDATEs set the transition capture map only when a new subplan
+			 * is chosen.  But for INSERTs, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INESRT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			if (mtstate->mt_transition_capture)
+			{
+				/*
+				 * Now revert back the transition capture map. See the above
+				 * comments.
+				 */
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1476,7 +1679,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1505,55 +1707,138 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 							 mtstate->mt_num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(mtstate->mt_partition_dispatch_info != NULL));
+
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes :
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ * result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need
+ * to convert the tuple from subplan result rel to target table descriptor,
+ * and for INSERTs, we need to convert the inserted tuple from leaf partition
+ * to the target table descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
+		/*
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * mt_subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
+		 */
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based
+		 * on the partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+
+		Assert(mtstate->mt_subplan_partition_offsets != NULL);
+		leaf_index = mtstate->mt_subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < mtstate->mt_num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1660,15 +1945,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1784,7 +2067,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1829,9 +2113,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1904,6 +2191,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1941,31 +2238,54 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		PartitionDispatch *partition_dispatch_info;
 		ResultRelInfo **partitions;
 		TupleConversionMap **partition_tupconv_maps;
+		int *subplan_leaf_map;
 		TupleTableSlot *partition_tuple_slot;
 		int			num_parted,
 					num_partitions;
 
 		ExecSetupPartitionTupleRouting(rel,
+									   mtstate->resultRelInfo,
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &partition_dispatch_info,
 									   &partitions,
 									   &partition_tupconv_maps,
+									   &subplan_leaf_map,
 									   &partition_tuple_slot,
 									   &num_parted, &num_partitions);
 		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
 		mtstate->mt_num_dispatch = num_parted;
 		mtstate->mt_partitions = partitions;
 		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
+		mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
 		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot();
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1976,6 +2296,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -2005,26 +2337,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2033,17 +2368,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2060,7 +2404,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2097,22 +2441,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < mtstate->mt_num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = mtstate->mt_partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2357,6 +2714,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2391,11 +2749,23 @@ ExecEndModifyTable(ModifyTableState *node)
 	{
 		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
 
+		/*
+		 * If this result rel is one of the subplan result rels, let
+		 * ExecEndPlan() close it. For INSERTs, this does not apply because
+		 * leaf partition result rels are always newly allocated.
+		 */
+		if (operation == CMD_UPDATE &&
+			resultRelInfo >= node->resultRelInfo &&
+			resultRelInfo < node->resultRelInfo + node->mt_nplans)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_root_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
 	if (node->mt_partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d9ff8a7..3e24d42 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c97ee24..932c1e7 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2103,6 +2104,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 7eb67fc0..9542b94 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d445477..549821e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2371,6 +2372,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6428,6 +6430,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6454,6 +6457,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f6b8bbf..d0fa9ed 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -112,6 +112,10 @@ typedef struct
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
+static void get_all_partition_cols(List *rtable,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols);
 static void inheritance_planner(PlannerInfo *root);
 static void grouping_planner(PlannerInfo *root, bool inheritance_update,
 				 double tuple_fraction);
@@ -1057,6 +1061,40 @@ preprocess_phv_expression(PlannerInfo *root, Expr *expr)
 }
 
 /*
+ * get_all_partition_cols
+ *	  Get attribute numbers of all partition key columns of all the partitioned
+ *    tables.
+ *
+ * All the child partition attribute numbers are converted to the root
+ * partitioned table.
+ */
+static void
+get_all_partition_cols(List *rtable,
+					   Index root_rti,
+					   List *partitioned_rels,
+					   Bitmapset **all_part_cols)
+{
+	ListCell   *lc;
+	Oid			root_relid = getrelid(root_rti, rtable);
+	Relation	root_rel;
+
+	/* The caller must have already locked all the partitioned tables. */
+	root_rel = heap_open(root_relid, NoLock);
+	*all_part_cols = NULL;
+	foreach(lc, partitioned_rels)
+	{
+		Index		rti = lfirst_int(lc);
+		Oid			relid = getrelid(rti, rtable);
+		Relation	part_rel = heap_open(relid, NoLock);
+
+		pull_child_partition_columns(part_rel, root_rel, all_part_cols);
+		heap_close(part_rel, NoLock);
+	}
+
+	heap_close(root_rel, NoLock);
+}
+
+/*
  * inheritance_planner
  *	  Generate Paths in the case where the result relation is an
  *	  inheritance set.
@@ -1101,6 +1139,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1171,10 +1210,23 @@ inheritance_planner(PlannerInfo *root)
 	parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable);
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		Bitmapset	*all_part_cols = NULL;
+
 		nominalRelation = top_parentRTindex;
 		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
+
+		/*
+		 * Retrieve the partition key columns of all the partitioned tables,
+		 * so as to check whether any of the columns being updated is
+		 * a partition key of any of the partition tables.
+		 */
+		get_all_partition_cols(root->parse->rtable, top_parentRTindex,
+							   partitioned_rels, &all_part_cols);
+
+		if (bms_overlap(all_part_cols, parent_rte->updatedCols))
+			partColsUpdated = true;
 	}
 
 	/*
@@ -1512,6 +1564,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2129,6 +2182,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 68dee0f..0ce5339 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3207,6 +3207,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the named relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3220,6 +3222,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3287,6 +3290,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 295e9d2..0e5922d 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -54,12 +54,14 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
-
+extern void pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 64e5aab..41be2cf 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -50,11 +50,14 @@ typedef struct PartitionDispatchData
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionDispatch **pd,
 							   ResultRelInfo ***partitions,
 							   TupleConversionMap ***tup_conv_maps,
+							   int **subplan_leaf_map,
 							   TupleTableSlot **partition_tuple_slot,
 							   int *num_parted, int *num_partitions);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index bee4ebf..0a2e76e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e05bc04..d2e8060 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -982,15 +982,19 @@ typedef struct ModifyTableState
 	int			mt_num_partitions;	/* Number of members in the following
 									 * arrays */
 	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
 	TupleTableSlot *mt_partition_tuple_slot;
+	TupleTableSlot *mt_root_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_parentchild_tupconv_maps;
+	/* Per partition map for tuple conversion from root to leaf */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
+	int		*mt_subplan_partition_offsets;
+	/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 9b38d44..b36dafc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9e68e65..d7687d3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..39ce47d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index a4fe961..50b76cf 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,367 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (null, 85, b, 15, 105).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, b, 7, 2).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20))))
+Partition constraint: (NOT (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +566,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +629,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +755,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..a07f113 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,229 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +338,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +367,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +466,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#199

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 8 years ago

In reply to: Amit Khandekar (#197)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

Thanks Amit.

Looking at the latest v25 patch.

On 2017/11/16 23:50, Amit Khandekar wrote:

Below has the responses for both Amit's and David's comments, starting
with Amit's ....
On 2 November 2017 at 12:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/10/24 0:15, Amit Khandekar wrote:

On 16 October 2017 at 08:28, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
== NULL))))

Is there some reason why a bitwise operator is used here?

That exact condition means that the function is called for transition
capture for updated rows being moved to another partition. For this
scenario, either the oldtup or the newtup is NULL. I wanted to exactly
capture that condition there. I think the bitwise operator is more
user-friendly in emphasizing the point that it is indeed an "either a
or b, not both" condition.

I see. In that case, since this patch adds the new condition, a note
about it in the comment just above would be good, because the situation
you describe here seems to arise only during update-tuple-routing, IIUC.

Done. Please check.

Looks fine.

+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *      instead of allocating new ones while generating the array of all leaf
+ *      partition result rels.
Instead of:

"These are re-used instead of allocating new ones while generating the
array of all leaf partition result rels."

how about:

"There is no need to allocate a new ResultRellInfo entry for leaf
partitions for which one already exists in this array"
Ok. I have made it like this :
+ * 'update_rri' contains the UPDATE per-subplan result rels. For the
output param
+ *             'partitions', we don't allocate new ResultRelInfo objects for
+ *             leaf partitions for which they are already available
in 'update_rri'.

Sure.

It looks like the interface does not much simplify, and above that, we
have more number of lines in that function. Also, the caller anyway
has to be aware whether the map_index is the index into the leaf
partitions or the update subplans. So it is not like the caller does
not have to be aware about whether the mapping should be
mt_persubplan_childparent_maps or mt_perleaf_parentchild_maps.

Hmm, I think we should try to make it so that the caller doesn't have to
be aware of that. And by caller I guess you mean ExecInsert(), which
should not be a place, IMHO, where to try to introduce a lot of new logic
specific to update tuple routing.

I think, for ExecInsert() since we have already given the job of
routing the tuple from root partitioned table to a partition, it makes
sense to give the function the additional job of routing the tuple
from any partition to any partition. ExecInsert() can be looked at as
doing this job : "insert a tuple into the right partition; the
original tuple can belong to any partition"

Yeah, that's one way of looking at that. But I think ExecInsert() as it
is today thinks it's got a *new* tuple and that's it. I think the newly
introduced code in it to find out that it is not so (that the tuple
actually comes from some other partition), that it's really the
update-turned-into-delete-plus-insert, and then switch to the root
partitioned table's ResultRelInfo, etc. really belongs outside of it.
Maybe in its caller, which is ExecUpdate(). I mean why not add this code
to the block in ExecUpdate() that handles update-row-movement.

Just before calling ExecInsert() to do the re-routing seems like a good
place to do all that. For example, try the attached incremental patch
that applies on top of yours. I can see after applying it that diffs to
ExecInsert() are now just some refactoring ones and there are no
significant additions making it look like supporting update-row-movement
required substantial changes to how ExecInsert() itself works.

After doing the changes for the int[] array map in the previous patch
version, I still feel that ConvertPartitionTupleSlot() should be
retained. We save some repeated lines of code saved.

OK.

You may be right, but I see for WithCheckOptions initialization
specifically that the non-tuple-routing code passes the actual sub-plan
when initializing the WCO for a given result rel.

Yes that's true. The problem with WithCheckOptions for newly allocated
partition result rels is : we can't use a subplan for the parent
parameter because there is no subplan for it. But I will still think
on it a bit more (TODO).

Alright.

I think you are suggesting we do it like how it's done in
is_partition_attr(). Can you please let me know other places we do
this same way ? I couldn't find.

OK, not as many as I thought there would be, but there are following
beside is_partition_attrs():

partition.c: get_range_nulltest()
partition.c: get_qual_for_range()
relcache.c: RelationBuildPartitionKey()

Ok, I think I will first address Robert's suggestion of re-using
is_partition_attrs() for pull_child_partition_columns(). If I do that,
this discussion won't be applicable, so I am deferring this one.
(TODO)

Sure, no problem.

Thanks,
Amit

Attachments:

v25-delta-pass-root-from-ExecUpdate.patchtext/plain; charset=UTF-8; name=v25-delta-pass-root-from-ExecUpdate.patchDownload

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a0d8259663..09d16f4509 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -327,37 +327,6 @@ ExecInsert(ModifyTableState *mtstate,
 	if (mtstate->mt_partition_dispatch_info)
 	{
 		int			leaf_part_index;
-		ResultRelInfo *rootResultRelInfo;
-
-		/*
-		 * For UPDATE, the resultRelInfo is not the root partitioned table, so
-		 * we should convert the tuple into root's tuple descriptor, since
-		 * ExecFindPartition() starts the search from root.  The tuple
-		 * conversion map list is in the order of mtstate->resultRelInfo[], so
-		 * to retrieve the one for this resultRel, we need to know the position
-		 * of the resultRel in mtstate->resultRelInfo[].
-		 */
-		if (mtstate->operation == CMD_UPDATE)
-		{
-			int			map_index = resultRelInfo - mtstate->resultRelInfo;
-			TupleConversionMap *tupconv_map;
-
-			Assert(mtstate->rootResultRelInfo != NULL);
-			rootResultRelInfo = mtstate->rootResultRelInfo;
-
-			/* resultRelInfo must be one of the per-subplan result rels. */
-			Assert(resultRelInfo >= mtstate->resultRelInfo &&
-				   resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1);
-
-			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
-			tuple = ConvertPartitionTupleSlot(mtstate,
-											  tupconv_map,
-											  tuple,
-											  mtstate->mt_root_tuple_slot,
-											  &slot);
-		}
-		else
-			rootResultRelInfo = resultRelInfo;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -367,7 +336,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
-		leaf_part_index = ExecFindPartition(rootResultRelInfo,
+		leaf_part_index = ExecFindPartition(resultRelInfo,
 											mtstate->mt_partition_dispatch_info,
 											slot,
 											estate);
@@ -1178,6 +1147,8 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleConversionMap *tupconv_map;
+			int			subplan_off;
 
 			/*
 			 * When an UPDATE is run with a leaf partition, we would not have
@@ -1227,8 +1198,30 @@ lreplace:;
 			if (mtstate->mt_transition_capture)
 				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
 
+
+			/*
+			 * For UPDATE, the resultRelInfo is not the root partitioned
+			 * table, so we should convert the tuple into root's tuple
+			 * descriptor, since ExecInsert() starts the search from root.
+			 */
+			subplan_off = resultRelInfo - mtstate->resultRelInfo;
+			tupconv_map = tupconv_map_for_subplan(mtstate, subplan_off);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  mtstate->mt_root_tuple_slot,
+											  &slot);
+
+			/*
+			 * Make it look like to ExecInsert() that we are inserting the
+			 * tuple into the root table.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
 			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
 								  ONCONFLICT_NONE, estate, canSetTag);
+			/* Restore for the next tuple. */
+			estate->es_result_relation_info = resultRelInfo;
 
 			if (mtstate->mt_transition_capture)
 			{

#200

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#198)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 21 November 2017 at 17:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 13 November 2017 at 18:25, David Rowley <david.rowley@2ndquadrant.com> wrote:

30. The following chunk of code is giving me a headache trying to
verify which arrays are which size:

ExecSetupPartitionTupleRouting(rel,
mtstate->resultRelInfo,
(operation == CMD_UPDATE ? nplans : 0),
node->nominalRelation,
estate,
&partition_dispatch_info,
&partitions,
&partition_tupconv_maps,
&subplan_leaf_map,
&partition_tuple_slot,
&num_parted, &num_partitions);
mtstate->mt_partition_dispatch_info = partition_dispatch_info;
mtstate->mt_num_dispatch = num_parted;
mtstate->mt_partitions = partitions;
mtstate->mt_num_partitions = num_partitions;
mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
mtstate->mt_partition_tuple_slot = partition_tuple_slot;
mtstate->mt_root_tuple_slot = MakeTupleTableSlot();

I know this patch is not completely responsible for it, but you're not
making things any better.

Would it not be better to invent some PartitionTupleRouting struct and
make that struct a member of ModifyTableState and CopyState, then just
pass the pointer to that struct to ExecSetupPartitionTupleRouting()
and have it fill in the required details? I think the complexity of
this is already on the high end, I think you really need to do the
refactor before this gets any worse.

Ok. I am currently working on doing this change. So not yet included in the attached patch. Will send yet another revised patch for this change.

Attached incremental patch encapsulate_partinfo.patch (to be applied
over the latest v25 patch) contains changes to move all the
partition-related information into new structure
PartitionTupleRouting. Further to that, I also moved
PartitionDispatchInfo into this structure. So it looks like this :

typedef struct PartitionTupleRouting
{
PartitionDispatch *partition_dispatch_info;
int num_dispatch;
ResultRelInfo **partitions;
int num_partitions;
TupleConversionMap **parentchild_tupconv_maps;
int *subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
} PartitionTupleRouting;

So this structure now encapsulates *all* the
partition-tuple-routing-related information. So ModifyTableState now
has only one tuple-routing-related field 'mt_partition_tuple_routing'.
It is changed like this :

@@ -976,24 +976,14 @@ typedef struct ModifyTableState
        TupleTableSlot *mt_existing;    /* slot to store existing
target tuple in */
        List       *mt_excludedtlist;   /* the excluded pseudo
relation's tlist  */
        TupleTableSlot *mt_conflproj;   /* CONFLICT ... SET ...
projection target */
-       struct PartitionDispatchData **mt_partition_dispatch_info;
-       /* Tuple-routing support info */
-       int                     mt_num_dispatch;        /* Number of
entries in the above array */
-       int                     mt_num_partitions;      /* Number of
members in the following
-
  * arrays */
-       ResultRelInfo **mt_partitions;  /* Per partition result
relation pointers */
-       TupleTableSlot *mt_partition_tuple_slot;
-       TupleTableSlot *mt_root_tuple_slot;
+       struct PartitionTupleRouting *mt_partition_tuple_routing; /*
Tuple-routing support info */
        struct TransitionCaptureState *mt_transition_capture;
        /* controls transition table population for specified operation */
        struct TransitionCaptureState *mt_oc_transition_capture;
        /* controls transition table population for INSERT...ON
CONFLICT UPDATE */
-       TupleConversionMap **mt_parentchild_tupconv_maps;
-       /* Per partition map for tuple conversion from root to leaf */
        TupleConversionMap **mt_childparent_tupconv_maps;
        /* Per plan/partition map for tuple conversion from child to root */
        bool            mt_is_tupconv_perpart;  /* Is the above map
per-partition ? */
-       int             *mt_subplan_partition_offsets;
        /* Stores position of update result rels in leaf partitions */
 } ModifyTableState;

So the code in nodeModifyTable.c (and similar code in copy.c) is
accordingly adjusted to use mtstate->mt_partition_tuple_routing.

The places where we use (mtstate->mt_partition_dispatch_info != NULL)
condition to run tuple-routing code, I have replaced it with
mtstate->mt_partition_tuple_routing != NULL.

If you are ok with the incremental patch, I can extract this change
into a separate preparatory patch to be applied on PG master.

Thanks
-Amit Khandekar

Attachments:

encapsulate_partinfo.patchapplication/octet-stream; name=encapsulate_partinfo.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 2854f21..39c2921 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,12 +165,9 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;	/* Number of entries in the above array */
-	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo **partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **partition_tupconv_maps;
-	TupleTableSlot *partition_tuple_slot;
+	PartitionTupleRouting *partition_tuple_routing; /* all tuple-routing info
+													 * for partitions.
+													 */
 	TransitionCaptureState *transition_capture;
 	TupleConversionMap **transition_tupconv_maps;
 
@@ -2471,30 +2468,16 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
+		PartitionTupleRouting *ptr;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
 									   NULL,
 									   0,
 									   1,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   NULL,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		cstate->partition_dispatch_info = partition_dispatch_info;
-		cstate->num_dispatch = num_parted;
-		cstate->partitions = partitions;
-		cstate->num_partitions = num_partitions;
-		cstate->partition_tupconv_maps = partition_tupconv_maps;
-		cstate->partition_tuple_slot = partition_tuple_slot;
+									   &cstate->partition_tuple_routing);
+
+		ptr = cstate->partition_tuple_routing;
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2507,11 +2490,11 @@ CopyFrom(CopyState cstate)
 			int			i;
 
 			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * cstate->num_partitions);
-			for (i = 0; i < cstate->num_partitions; ++i)
+				palloc0(sizeof(TupleConversionMap *) * ptr->num_partitions);
+			for (i = 0; i < ptr->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(ptr->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2531,7 +2514,7 @@ CopyFrom(CopyState cstate)
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		cstate->partition_dispatch_info != NULL ||
+		cstate->partition_tuple_routing != NULL ||
 		cstate->volatile_defexprs)
 	{
 		useHeapMultiInsert = false;
@@ -2606,10 +2589,11 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
+		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
 			TupleConversionMap *map;
+			PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2620,11 +2604,11 @@ CopyFrom(CopyState cstate)
 			 * partition, respectively.
 			 */
 			leaf_part_index = ExecFindPartition(resultRelInfo,
-												cstate->partition_dispatch_info,
+												ptr->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < cstate->num_partitions);
+				   leaf_part_index < ptr->num_partitions);
 
 			/*
 			 * If this tuple is mapped to a partition that is not same as the
@@ -2642,7 +2626,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions[leaf_part_index];
+			resultRelInfo = ptr->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2689,7 +2673,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = cstate->partition_tupconv_maps[leaf_part_index];
+			map = ptr->parentchild_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2701,7 +2685,7 @@ CopyFrom(CopyState cstate)
 				 * point on.  Use a dedicated slot from this point on until
 				 * we're finished dealing with the partition.
 				 */
-				slot = cstate->partition_tuple_slot;
+				slot = ptr->partition_tuple_slot;
 				Assert(slot != NULL);
 				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -2853,8 +2837,9 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_dispatch_info)
+	if (cstate->partition_tuple_routing)
 	{
+		PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 		int			i;
 
 		/*
@@ -2863,23 +2848,23 @@ CopyFrom(CopyState cstate)
 		 * the main target table of COPY that will be closed eventually by
 		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
 		 */
-		for (i = 1; i < cstate->num_dispatch; i++)
+		for (i = 1; i < ptr->num_dispatch; i++)
 		{
-			PartitionDispatch pd = cstate->partition_dispatch_info[i];
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
 			heap_close(pd->reldesc, NoLock);
 			ExecDropSingleTupleTableSlot(pd->tupslot);
 		}
-		for (i = 0; i < cstate->num_partitions; i++)
+		for (i = 0; i < ptr->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions[i];
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
 		/* Release the standalone partition tuple descriptor */
-		ExecDropSingleTupleTableSlot(cstate->partition_tuple_slot);
+		ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
 
 	/* Close any trigger target relations */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 2ac7484..3b72547 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -49,22 +49,9 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  *      INSERT.
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo* objects with one entry for
- *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
- * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
- *		to manipulate any given leaf partition's rowtype after that partition
- *		is chosen by tuple-routing.
- * 'num_parted' receives the number of partitioned tables in the partition
- *		tree (= the number of entries in the 'pd' output array)
- * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *
+ * 'partition_tuple_routing' encapsulates all the partition related information
+ *		required to do tuple routing.
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
@@ -75,12 +62,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   int **subplan_leaf_map,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions)
+							   PartitionTupleRouting **partition_tuple_routing)
 {
 	TupleDesc	tupDesc = RelationGetDescr(rel);
 	List	   *leaf_parts;
@@ -89,20 +71,23 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	ResultRelInfo *leaf_part_arr = NULL;
 	int			update_rri_index = 0;
 	bool		is_update = (num_update_rri > 0);
+	PartitionTupleRouting *ptr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
 	 */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
-	*num_partitions = list_length(leaf_parts);
-	if (subplan_leaf_map)
-		*subplan_leaf_map = NULL;
-	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+	ptr = *partition_tuple_routing =
+		(PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	ptr->partition_dispatch_info =
+		RelationGetPartitionDispatchInfo(rel, &ptr->num_dispatch, &leaf_parts);
+	ptr->num_partitions = list_length(leaf_parts);
+	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	ptr->parentchild_tupconv_maps =
+		(TupleConversionMap **) palloc0(ptr->num_partitions *
+										sizeof(TupleConversionMap *));
 
 	if (is_update)
 	{
@@ -123,7 +108,13 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * Prepare for generating the mapping from subplan result rels to leaf
 		 * partition position.
 		 */
-		*subplan_leaf_map = palloc(num_update_rri * sizeof(int));
+		ptr->subplan_partition_offsets = palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		ptr->root_tuple_slot = MakeTupleTableSlot();
 	}
 	else
 	{
@@ -132,7 +123,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * repeated pallocs by allocating memory for all the result rels in
 		 * bulk.
 		 */
-		leaf_part_arr = (ResultRelInfo *) palloc0(*num_partitions *
+		leaf_part_arr = (ResultRelInfo *) palloc0(ptr->num_partitions *
 												  sizeof(ResultRelInfo));
 	}
 
@@ -142,7 +133,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 * (such as ModifyTableState) and released when the node finishes
 	 * processing.
 	 */
-	*partition_tuple_slot = MakeTupleTableSlot();
+	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
 	i = 0;
 	foreach(cell, leaf_parts)
@@ -173,7 +164,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 				 * Save the position of this update rel in the leaf partitions
 				 * array
 				 */
-				(*subplan_leaf_map)[update_rri_index] = i;
+				ptr->subplan_partition_offsets[update_rri_index] = i;
 
 				update_rri_index++;
 			}
@@ -211,7 +202,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->parentchild_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
 		/*
@@ -233,7 +224,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri;
+		ptr->partitions[i] = leaf_part_rri;
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a0d8259..75269bf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -324,10 +324,11 @@ ExecInsert(ModifyTableState *mtstate,
 	resultRelInfo = estate->es_result_relation_info;
 
 	/* Determine the partition to heap_insert the tuple into */
-	if (mtstate->mt_partition_dispatch_info)
+	if (mtstate->mt_partition_tuple_routing)
 	{
 		int			leaf_part_index;
 		ResultRelInfo *rootResultRelInfo;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
 		/*
 		 * For UPDATE, the resultRelInfo is not the root partitioned table, so
@@ -353,7 +354,7 @@ ExecInsert(ModifyTableState *mtstate,
 			tuple = ConvertPartitionTupleSlot(mtstate,
 											  tupconv_map,
 											  tuple,
-											  mtstate->mt_root_tuple_slot,
+											  ptr->root_tuple_slot,
 											  &slot);
 		}
 		else
@@ -363,23 +364,23 @@ ExecInsert(ModifyTableState *mtstate,
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_parentchild_tupconv_maps[] that will get us
-		 * the ResultRelInfo and TupleConversionMap for the partition,
+		 * ptr->partitions[] and ptr->parentchild_tupconv_maps[] that will get
+		 * us the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(rootResultRelInfo,
-											mtstate->mt_partition_dispatch_info,
+											ptr->partition_dispatch_info,
 											slot,
 											estate);
 		Assert(leaf_part_index >= 0 &&
-			   leaf_part_index < mtstate->mt_num_partitions);
+			   leaf_part_index < ptr->num_partitions);
 
 		/*
 		 * Save the old ResultRelInfo and switch to the one corresponding to
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
+		resultRelInfo = ptr->partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -433,9 +434,9 @@ ExecInsert(ModifyTableState *mtstate,
 		 * rowtype.
 		 */
 		tuple = ConvertPartitionTupleSlot(mtstate,
-										  mtstate->mt_parentchild_tupconv_maps[leaf_part_index],
+										  ptr->parentchild_tupconv_maps[leaf_part_index],
 										  tuple,
-										  mtstate->mt_partition_tuple_slot,
+										  ptr->partition_tuple_slot,
 										  &slot);
 	}
 
@@ -1184,7 +1185,7 @@ lreplace:;
 			 * partition tuple routing setup. In that case, fail with
 			 * partition constraint violation error.
 			 */
-			if (mtstate->mt_partition_dispatch_info == NULL)
+			if (mtstate->mt_partition_tuple_routing == NULL)
 				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 			/* Do the row movement. */
@@ -1702,13 +1703,14 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		mtstate->mt_oc_transition_capture != NULL)
 	{
 		int			numResultRelInfos;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
+		numResultRelInfos = (ptr != NULL ?
+							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
 		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
-								(mtstate->mt_partition_dispatch_info != NULL));
+								(ptr != NULL));
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1751,7 +1753,7 @@ ExecSetupChildParentMap(ModifyTableState *mtstate,
 		 * has to be per-leaf. If that map is per-subplan, we won't be able to
 		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
 		 * will be able to access the maps subplan-wise using the
-		 * mt_subplan_partition_offsets map using function
+		 * subplan_partition_offsets map using function
 		 * tupconv_map_for_subplan().  So if the callers might need to access
 		 * the map both leaf-partition-wise and subplan-wise, they should make
 		 * sure that the first time this function is called, it should be
@@ -1781,7 +1783,10 @@ ExecSetupChildParentMap(ModifyTableState *mtstate,
 		 * For tuple routing among partitions, we need TupleDescs based
 		 * on the partition routing table.
 		 */
-		ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+		ResultRelInfo **resultRelInfos;
+
+		Assert(mtstate->mt_partition_tuple_routing != NULL);
+		resultRelInfos = mtstate->mt_partition_tuple_routing->partitions;
 
 		for (i = 0; i < numResultRelInfos; ++i)
 		{
@@ -1828,11 +1833,12 @@ tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 	if (mtstate->mt_is_tupconv_perpart)
 	{
 		int leaf_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
-		Assert(mtstate->mt_subplan_partition_offsets != NULL);
-		leaf_index = mtstate->mt_subplan_partition_offsets[whichplan];
+		Assert(ptr && ptr->subplan_partition_offsets != NULL);
+		leaf_index = ptr->subplan_partition_offsets[whichplan];
 
-		Assert(leaf_index >= 0 && leaf_index < mtstate->mt_num_partitions);
+		Assert(leaf_index >= 0 && leaf_index < ptr->num_partitions);
 		return mtstate->mt_childparent_tupconv_maps[leaf_index];
 	}
 	else
@@ -2119,6 +2125,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	int			i;
 	Relation	rel;
 	bool		update_tuple_routing_needed = node->partKeyUpdated;
+	PartitionTupleRouting *ptr = NULL;
+	int			num_partitions = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -2252,33 +2260,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		int *subplan_leaf_map;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
-
 		ExecSetupPartitionTupleRouting(rel,
 									   mtstate->resultRelInfo,
 									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &subplan_leaf_map,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-		mtstate->mt_num_dispatch = num_parted;
-		mtstate->mt_partitions = partitions;
-		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_parentchild_tupconv_maps = partition_tupconv_maps;
-		mtstate->mt_subplan_partition_offsets = subplan_leaf_map;
-		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
-		mtstate->mt_root_tuple_slot = MakeTupleTableSlot();
+									   &mtstate->mt_partition_tuple_routing);
+
+		ptr = mtstate->mt_partition_tuple_routing;
+		num_partitions = ptr->num_partitions;
 
 		/*
 		 * Below are required as reference objects for mapping partition
@@ -2340,7 +2330,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
 	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
+	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
 		List	   *first_wcoList;
 
@@ -2360,14 +2350,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				mtstate->mt_nplans == 1));
 
 		first_wcoList = linitial(node->withCheckOptionLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 
 			/*
 			 * If we are referring to a resultRelInfo from one of the update
@@ -2445,12 +2435,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
 		firstReturningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 
 			/*
 			 * If we are referring to a resultRelInfo from one of the update
@@ -2733,41 +2723,46 @@ ExecEndModifyTable(ModifyTableState *node)
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
-	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
+	 * Remember ptr->partition_dispatch_info[0] corresponds to the root
 	 * partitioned table, which we must not try to close, because it is the
 	 * main target table of the query that will be closed by ExecEndPlan().
 	 * Also, tupslot is NULL for the root partitioned table.
 	 */
-	for (i = 1; i < node->mt_num_dispatch; i++)
+	if (node->mt_partition_tuple_routing)
 	{
-		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+		PartitionTupleRouting *ptr = node->mt_partition_tuple_routing;
 
-		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
-	}
-	for (i = 0; i < node->mt_num_partitions; i++)
-	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+		for (i = 1; i < ptr->num_dispatch; i++)
+		{
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
-		/*
-		 * If this result rel is one of the subplan result rels, let
-		 * ExecEndPlan() close it. For INSERTs, this does not apply because
-		 * leaf partition result rels are always newly allocated.
-		 */
-		if (operation == CMD_UPDATE &&
-			resultRelInfo >= node->resultRelInfo &&
-			resultRelInfo < node->resultRelInfo + node->mt_nplans)
-			continue;
+			heap_close(pd->reldesc, NoLock);
+			ExecDropSingleTupleTableSlot(pd->tupslot);
+		}
+		for (i = 0; i < ptr->num_partitions; i++)
+		{
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
 
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
+			/*
+			 * If this result rel is one of the subplan result rels, let
+			 * ExecEndPlan() close it. For INSERTs, this does not apply because
+			 * leaf partition result rels are always newly allocated.
+			 */
+			if (operation == CMD_UPDATE &&
+				resultRelInfo >= node->resultRelInfo &&
+				resultRelInfo < node->resultRelInfo + node->mt_nplans)
+				continue;
+
+			ExecCloseIndices(resultRelInfo);
+			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+		}
 
-	/* Release the standalone partition tuple descriptors, if any */
-	if (node->mt_root_tuple_slot)
-		ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
-	if (node->mt_partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
+		/* Release the standalone partition tuple descriptors, if any */
+		if (ptr->root_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->root_tuple_slot);
+		if (ptr->partition_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 41be2cf..7e69c48 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -49,17 +49,51 @@ typedef struct PartitionDispatchData
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to execute
+ * tuple-routing between partitions.
+ *
+ * partition_dispatch_info		Array of PartitionDispatch objects with one
+ *								entry for every partitioned table in the
+ *								partition tree.
+ * num_dispatch					number of partitioned tables in the partition
+ *								tree (= length of partition_dispatch_info[])
+ * partitions					Array of ResultRelInfo* objects with one entry
+ *								for every leaf partition in the partition tree.
+ * num_partitions				Number of leaf partitions in the partition tree
+ *								(= 'partitions' array length)
+ * parentchild_tupconv_maps		Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the root table's
+ *								rowtype to a leaf partition's rowtype after
+ *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
+ * partition_tuple_slot			TupleTableSlot to be used to manipulate any
+ *								given leaf partition's rowtype after that
+ *								partition is chosen for insertion by
+ *								tuple-routing.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	TupleConversionMap **parentchild_tupconv_maps;
+	int		   *subplan_partition_offsets;
+	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
+} PartitionTupleRouting;
+
 extern void ExecSetupPartitionTupleRouting(Relation rel,
 							   ResultRelInfo *update_rri,
 							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   int **subplan_leaf_map,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions);
+							   PartitionTupleRouting **partition_tuple_routing);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d2e8060..64cf3dd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -976,24 +976,14 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
-	struct PartitionDispatchData **mt_partition_dispatch_info;
-	/* Tuple-routing support info */
-	int			mt_num_dispatch;	/* Number of entries in the above array */
-	int			mt_num_partitions;	/* Number of members in the following
-									 * arrays */
-	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleTableSlot *mt_partition_tuple_slot;
-	TupleTableSlot *mt_root_tuple_slot;
+	struct PartitionTupleRouting *mt_partition_tuple_routing; /* Tuple-routing support info */
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_parentchild_tupconv_maps;
-	/* Per partition map for tuple conversion from root to leaf */
 	TupleConversionMap **mt_childparent_tupconv_maps;
 	/* Per plan/partition map for tuple conversion from child to root */
 	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
-	int		*mt_subplan_partition_offsets;
 	/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;

#201

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Robert Haas (#184)

Re: [HACKERS] UPDATE of partition key

On 7 November 2017 at 00:33, Robert Haas <robertmhaas@gmail.com> wrote:

+       /* The caller must have already locked all the partitioned tables. */
+       root_rel = heap_open(root_relid, NoLock);
+       *all_part_cols = NULL;
+       foreach(lc, partitioned_rels)
+       {
+               Index           rti = lfirst_int(lc);
+               Oid                     relid = getrelid(rti, rtables);
+               Relation        part_rel = heap_open(relid, NoLock);
+
+               pull_child_partition_columns(part_rel, root_rel, all_part_cols);
+               heap_close(part_rel, NoLock);
I don't like the fact that we're opening and closing the relation here
just to get information on the partitioning columns. I think it would
be better to do this someplace that already has the relation open and
store the details in the RelOptInfo. set_relation_partition_info()
looks like the right spot.

It seems, for UPDATE, baserel RelOptInfos are created only for the
subplan relations, not for the partitioned tables. I verified that
build_simple_rel() does not get called for paritioned tables for
UPDATE.

In earlier versions of the patch, we used to collect the partition
keys while expanding the partition tree so that we could get them
while the relations are open. After some reviews, I was inclined to
think that the collection logic better be moved out into the
inheritance_planner(), because it involved pulling the attributes from
partition key expressions, and the bitmap operation, which would be
unnecessarily done for SELECTs as well.

On the other hand, if we collect the partition keys separately in
inheritance_planner(), then as you say, we need to open the relations.
Opening the relation which is already in relcache and which is already
locked, involves only a hash lookup. Do you think this is expensive ?
I am open for either of these approaches.

If we collect the partition keys in expand_partitioned_rtentry(), we
need to pass the root relation also, so that we can convert the
partition key attributes to root rel descriptor. And the other thing
is, may be, we can check beforehand (in expand_inherited_rtentry)
whether the rootrte's updatedCols is empty, which I think implies that
it's not an UPDATE operation. If yes, we can just skip collecting the
partition keys.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#202

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 8 years ago

In reply to: Amit Khandekar (#201)

Re: [HACKERS] UPDATE of partition key

On 2017/11/23 21:57, Amit Khandekar wrote:

If we collect the partition keys in expand_partitioned_rtentry(), we
need to pass the root relation also, so that we can convert the
partition key attributes to root rel descriptor. And the other thing
is, may be, we can check beforehand (in expand_inherited_rtentry)
whether the rootrte's updatedCols is empty, which I think implies that
it's not an UPDATE operation. If yes, we can just skip collecting the
partition keys.

Yeah, it seems like a good idea after all to check in
expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
and if so check if any of the updatedCols are partition keys. If we find
some, then it will suffice to just set a simple flag in the
PartitionedChildRelInfo that will be created for the root table. That
should be done *after* we have visited all the tables in the partition
tree including some that might be partitioned and hence will provide their
partition keys. The following block in expand_inherited_rtentry() looks
like a good spot:

if (rte->inh && partitioned_child_rels != NIL)
{
PartitionedChildRelInfo *pcinfo;

pcinfo = makeNode(PartitionedChildRelInfo);

Thanks,
Amit

#203

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Langote (#202)

Re: [HACKERS] UPDATE of partition key

On 24 November 2017 at 10:52, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/11/23 21:57, Amit Khandekar wrote:

If we collect the partition keys in expand_partitioned_rtentry(), we
need to pass the root relation also, so that we can convert the
partition key attributes to root rel descriptor. And the other thing
is, may be, we can check beforehand (in expand_inherited_rtentry)
whether the rootrte's updatedCols is empty, which I think implies that
it's not an UPDATE operation. If yes, we can just skip collecting the
partition keys.

Yeah, it seems like a good idea after all to check in
expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
and if so check if any of the updatedCols are partition keys. If we find
some, then it will suffice to just set a simple flag in the
PartitionedChildRelInfo that will be created for the root table. That
should be done *after* we have visited all the tables in the partition
tree including some that might be partitioned and hence will provide their
partition keys. The following block in expand_inherited_rtentry() looks
like a good spot:

if (rte->inh && partitioned_child_rels != NIL)
{
PartitionedChildRelInfo *pcinfo;

pcinfo = makeNode(PartitionedChildRelInfo);

Yes, I am thinking about something like that. Thanks.

I am also working on your suggestion of moving the
convert-to-root-descriptor logic from ExecInsert() to ExecUpdate().

So, in the upcoming patch version, I am intending to include the above
two, and if possible, Robert's idea of re-using is_partition_attr()
for pull_child_partition_columns().

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#204

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#203)

Re: [HACKERS] UPDATE of partition key

On Mon, Nov 27, 2017 at 5:28 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

So, in the upcoming patch version, I am intending to include the above
two, and if possible, Robert's idea of re-using is_partition_attr()
for pull_child_partition_columns().

Discussions are still going on, so moved to next CF.
--
Michael

#205

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#203)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 27 November 2017 at 13:58, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 24 November 2017 at 10:52, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/11/23 21:57, Amit Khandekar wrote:

If we collect the partition keys in expand_partitioned_rtentry(), we
need to pass the root relation also, so that we can convert the
partition key attributes to root rel descriptor. And the other thing
is, may be, we can check beforehand (in expand_inherited_rtentry)
whether the rootrte's updatedCols is empty, which I think implies that
it's not an UPDATE operation. If yes, we can just skip collecting the
partition keys.

Yeah, it seems like a good idea after all to check in
expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
and if so check if any of the updatedCols are partition keys. If we find
some, then it will suffice to just set a simple flag in the
PartitionedChildRelInfo that will be created for the root table. That
should be done *after* we have visited all the tables in the partition
tree including some that might be partitioned and hence will provide their
partition keys. The following block in expand_inherited_rtentry() looks
like a good spot:

if (rte->inh && partitioned_child_rels != NIL)
{
PartitionedChildRelInfo *pcinfo;

pcinfo = makeNode(PartitionedChildRelInfo);

Yes, I am thinking about something like that. Thanks.

In expand_partitioned_rtentry(), rather than collecting partition key
attributes of all partitioned tables, I am now checking if
parentrte->updatedCols has any partition key attributes. If an earlier
parentrte's updatedCols was already found to have partition-keys,
don't continue to check more.

Also, rather than converting all the partition key attriubtes to be
compatible with root's tuple descriptor, we better compare with each
of the partitioned table's updatedCols when we have their handle
handy. Each of the parentrte's updatedCols has exactly the same
attributes as the root's, just with the ordering possibly changed. So
it is safe to compare using the updatedCols of intermediate partition
rels rather than those of the root rel. And, the advantage is : we now
got rid of the conversion mapping from each of the partitioned table
to root that was earlier done in pull_child_partition_columns() in the
previous patches.

PartitionedChildRelInfo now has is_partition_key_update field. This is
updated using get_partitioned_child_rels().

I am also working on your suggestion of moving the
convert-to-root-descriptor logic from ExecInsert() to ExecUpdate().

Done.

So, in the upcoming patch version, I am intending to include the above
two, and if possible, Robert's idea of re-using is_partition_attr()
for pull_child_partition_columns().

Done. Now, is_partition_attr() is renamed to has_partition_attrs().
This function now accepts a bitmapset of attnums instead of a single
attnum. Moved this function from tablecmds.c to partition.c. This is
now re-used, and the earlier pull_child_partition_columns() is
removed.

Attached v26, that has all of the above points covered. Also, this
patch contains the incremental changes that were attached in the patch
encapsulate_partinfo.patch attached in [1]/messages/by-id/CAJ3gD9f86H64e4OCjFFszWW7f4EeyriSaFL8SvJs2yOUbc8VEw@mail.gmail.com. In the next version, I
will extract them out again and keep them as a separate preparatory
patch.

[1]: /messages/by-id/CAJ3gD9f86H64e4OCjFFszWW7f4EeyriSaFL8SvJs2yOUbc8VEw@mail.gmail.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v26.patchapplication/octet-stream; name=update-partition-key_v26.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index e6f50ec..1517757 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3297,9 +3302,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..3c665f0 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..aaffc4d 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by <command>INSERT</command> into the
+    new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index d622305..57dc08f 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1441,7 +1441,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1454,8 +1455,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1464,14 +1465,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2595,6 +2596,69 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 }
 
 /*
+ * Checks if any of the 'attnums' is a partition key attribute for rel
+ *
+ * Sets *used_in_expr if any of the 'attnums' is found to be referenced in some
+ * partition key expression.  It's possible for a column to be both used
+ * directly and as part of an expression; if that happens, *used_in_expr may
+ * end up as either true or false.  That's OK for current uses of this
+ * function, because *used_in_expr is only used to tailor the error message
+ * text.
+ */
+bool
+has_partition_attrs(Relation rel, Bitmapset *attnums, bool *used_in_expr)
+{
+	PartitionKey key;
+	int			partnatts;
+	List	   *partexprs;
+	ListCell   *partexprs_item;
+	int			i;
+
+	if (attnums == NULL || rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		return false;
+
+	key = RelationGetPartitionKey(rel);
+	partnatts = get_partition_natts(key);
+	partexprs = get_partition_exprs(key);
+
+	partexprs_item = list_head(partexprs);
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+		{
+			if (bms_is_member(partattno - FirstLowInvalidHeapAttributeNumber,
+							  attnums))
+			{
+				if (used_in_expr)
+					*used_in_expr = false;
+				return true;
+			}
+		}
+		else
+		{
+			/* Arbitrary expression */
+			Node	   *expr = (Node *) lfirst(partexprs_item);
+			Bitmapset  *expr_attrs = NULL;
+
+			/* Find all attributes referenced */
+			pull_varattnos(expr, 1, &expr_attrs);
+			partexprs_item = lnext(partexprs_item);
+
+			if (bms_overlap(attnums, expr_attrs))
+			{
+				if (used_in_expr)
+					*used_in_expr = true;
+				return true;
+			}
+		}
+	}
+
+	return false;
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index d6b235c..39c2921 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,12 +165,9 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;	/* Number of entries in the above array */
-	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo **partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **partition_tupconv_maps;
-	TupleTableSlot *partition_tuple_slot;
+	PartitionTupleRouting *partition_tuple_routing; /* all tuple-routing info
+													 * for partitions.
+													 */
 	TransitionCaptureState *transition_capture;
 	TupleConversionMap **transition_tupconv_maps;
 
@@ -2471,27 +2468,16 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
+		PartitionTupleRouting *ptr;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		cstate->partition_dispatch_info = partition_dispatch_info;
-		cstate->num_dispatch = num_parted;
-		cstate->partitions = partitions;
-		cstate->num_partitions = num_partitions;
-		cstate->partition_tupconv_maps = partition_tupconv_maps;
-		cstate->partition_tuple_slot = partition_tuple_slot;
+									   &cstate->partition_tuple_routing);
+
+		ptr = cstate->partition_tuple_routing;
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2504,11 +2490,11 @@ CopyFrom(CopyState cstate)
 			int			i;
 
 			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * cstate->num_partitions);
-			for (i = 0; i < cstate->num_partitions; ++i)
+				palloc0(sizeof(TupleConversionMap *) * ptr->num_partitions);
+			for (i = 0; i < ptr->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(ptr->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2528,7 +2514,7 @@ CopyFrom(CopyState cstate)
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		cstate->partition_dispatch_info != NULL ||
+		cstate->partition_tuple_routing != NULL ||
 		cstate->volatile_defexprs)
 	{
 		useHeapMultiInsert = false;
@@ -2603,10 +2589,11 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
+		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
 			TupleConversionMap *map;
+			PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2617,11 +2604,11 @@ CopyFrom(CopyState cstate)
 			 * partition, respectively.
 			 */
 			leaf_part_index = ExecFindPartition(resultRelInfo,
-												cstate->partition_dispatch_info,
+												ptr->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < cstate->num_partitions);
+				   leaf_part_index < ptr->num_partitions);
 
 			/*
 			 * If this tuple is mapped to a partition that is not same as the
@@ -2639,7 +2626,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions[leaf_part_index];
+			resultRelInfo = ptr->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2686,7 +2673,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = cstate->partition_tupconv_maps[leaf_part_index];
+			map = ptr->parentchild_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2698,7 +2685,7 @@ CopyFrom(CopyState cstate)
 				 * point on.  Use a dedicated slot from this point on until
 				 * we're finished dealing with the partition.
 				 */
-				slot = cstate->partition_tuple_slot;
+				slot = ptr->partition_tuple_slot;
 				Assert(slot != NULL);
 				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -2749,7 +2736,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
@@ -2850,8 +2837,9 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_dispatch_info)
+	if (cstate->partition_tuple_routing)
 	{
+		PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 		int			i;
 
 		/*
@@ -2860,23 +2848,23 @@ CopyFrom(CopyState cstate)
 		 * the main target table of COPY that will be closed eventually by
 		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
 		 */
-		for (i = 1; i < cstate->num_dispatch; i++)
+		for (i = 1; i < ptr->num_dispatch; i++)
 		{
-			PartitionDispatch pd = cstate->partition_dispatch_info[i];
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
 			heap_close(pd->reldesc, NoLock);
 			ExecDropSingleTupleTableSlot(pd->tupslot);
 		}
-		for (i = 0; i < cstate->num_partitions; i++)
+		for (i = 0; i < ptr->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions[i];
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
 		/* Release the standalone partition tuple descriptor */
-		ExecDropSingleTupleTableSlot(cstate->partition_tuple_slot);
+		ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
 
 	/* Close any trigger target relations */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d979ce2..64c2185 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -468,7 +468,6 @@ static void RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid,
 								Oid oldRelOid, void *arg);
 static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
 								 Oid oldrelid, void *arg);
-static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
 static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
 static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 					  List **partexprs, Oid *partopclass, Oid *partcollation, char strategy);
@@ -6492,68 +6491,6 @@ ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
 }
 
 /*
- * Checks if attnum is a partition attribute for rel
- *
- * Sets *used_in_expr if attnum is found to be referenced in some partition
- * key expression.  It's possible for a column to be both used directly and
- * as part of an expression; if that happens, *used_in_expr may end up as
- * either true or false.  That's OK for current uses of this function, because
- * *used_in_expr is only used to tailor the error message text.
- */
-static bool
-is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr)
-{
-	PartitionKey key;
-	int			partnatts;
-	List	   *partexprs;
-	ListCell   *partexprs_item;
-	int			i;
-
-	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		return false;
-
-	key = RelationGetPartitionKey(rel);
-	partnatts = get_partition_natts(key);
-	partexprs = get_partition_exprs(key);
-
-	partexprs_item = list_head(partexprs);
-	for (i = 0; i < partnatts; i++)
-	{
-		AttrNumber	partattno = get_partition_col_attnum(key, i);
-
-		if (partattno != 0)
-		{
-			if (attnum == partattno)
-			{
-				if (used_in_expr)
-					*used_in_expr = false;
-				return true;
-			}
-		}
-		else
-		{
-			/* Arbitrary expression */
-			Node	   *expr = (Node *) lfirst(partexprs_item);
-			Bitmapset  *expr_attrs = NULL;
-
-			/* Find all attributes referenced */
-			pull_varattnos(expr, 1, &expr_attrs);
-			partexprs_item = lnext(partexprs_item);
-
-			if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
-							  expr_attrs))
-			{
-				if (used_in_expr)
-					*used_in_expr = true;
-				return true;
-			}
-		}
-	}
-
-	return false;
-}
-
-/*
  * Return value is the address of the dropped column.
  */
 static ObjectAddress
@@ -6613,7 +6550,9 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 						colName)));
 
 	/* Don't drop columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
@@ -8837,7 +8776,9 @@ ATPrepAlterColumnType(List **wqueue,
 						colName)));
 
 	/* Don't alter columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..73ec872 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for row being inserted,
+		 * whereas newtup is NULL when the event is for row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,17 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so either can be NULL, not both.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dbaa47f..5ec92d5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 2fc411a..180798f 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -41,54 +41,91 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' contains the UPDATE per-subplan result rels. For the output
+ *		param 'partitions', we don't allocate new ResultRelInfo objects for
+ *		leaf partitions for which they are already available in 'update_rri'.
+ *
+ * 'num_update_rri' is the number of elements in 'update_rri' array or zero for
+ *      INSERT.
+ *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo* objects with one entry for
- *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
- * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
- *		to manipulate any given leaf partition's rowtype after that partition
- *		is chosen by tuple-routing.
- * 'num_parted' receives the number of partitioned tables in the partition
- *		tree (= the number of entries in the 'pd' output array)
- * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *
+ * 'partition_tuple_routing' encapsulates all the partition related information
+ *		required to do tuple routing.
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions)
+							   PartitionTupleRouting **partition_tuple_routing)
 {
 	TupleDesc	tupDesc = RelationGetDescr(rel);
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL;
+	int			update_rri_index = 0;
+	bool		is_update = (num_update_rri > 0);
+	PartitionTupleRouting *ptr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
 	 */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
-	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+	ptr = *partition_tuple_routing =
+		(PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	ptr->partition_dispatch_info =
+		RelationGetPartitionDispatchInfo(rel, &ptr->num_dispatch, &leaf_parts);
+	ptr->num_partitions = list_length(leaf_parts);
+	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	ptr->parentchild_tupconv_maps =
+		(TupleConversionMap **) palloc0(ptr->num_partitions *
+										sizeof(TupleConversionMap *));
+
+	if (is_update)
+	{
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		ptr->subplan_partition_offsets = palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		ptr->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(ptr->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -96,39 +133,82 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 * (such as ModifyTableState) and released when the node finishes
 	 * processing.
 	 */
-	*partition_tuple_slot = MakeTupleTableSlot();
+	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				ptr->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->parentchild_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -144,9 +224,15 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
 }
 
 /*
@@ -177,8 +263,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 201c607..919b32d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +251,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +308,9 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -279,32 +324,32 @@ ExecInsert(ModifyTableState *mtstate,
 	resultRelInfo = estate->es_result_relation_info;
 
 	/* Determine the partition to heap_insert the tuple into */
-	if (mtstate->mt_partition_dispatch_info)
+	if (mtstate->mt_partition_tuple_routing)
 	{
 		int			leaf_part_index;
-		TupleConversionMap *map;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
-		 * the ResultRelInfo and TupleConversionMap for the partition,
+		 * ptr->partitions[] and ptr->parentchild_tupconv_maps[] that will get
+		 * us the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
-											mtstate->mt_partition_dispatch_info,
+											ptr->partition_dispatch_info,
 											slot,
 											estate);
 		Assert(leaf_part_index >= 0 &&
-			   leaf_part_index < mtstate->mt_num_partitions);
+			   leaf_part_index < ptr->num_partitions);
 
 		/*
 		 * Save the old ResultRelInfo and switch to the one corresponding to
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
+		resultRelInfo = ptr->partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -331,8 +376,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -345,30 +392,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = mtstate->mt_partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  ptr->parentchild_tupconv_maps[leaf_part_index],
+										  tuple,
+										  ptr->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -486,7 +524,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -622,9 +660,32 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tables, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -678,6 +739,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tuple_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -685,6 +748,12 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
+
+	if (tuple_deleted)
+		*tuple_deleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -849,12 +918,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (tuple_deleted)
+		*tuple_deleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -947,6 +1043,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1043,12 +1140,117 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (ptr == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip INSERT as
+			 * well, otherwise, there will be effectively one new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * UPDATEs set the transition capture map only when a new subplan
+			 * is chosen.  But for INSERTs, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INESRT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  ptr->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1476,7 +1678,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1500,60 +1701,148 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		mtstate->mt_oc_transition_capture != NULL)
 	{
 		int			numResultRelInfos;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
+		numResultRelInfos = (ptr != NULL ?
+							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(ptr != NULL));
+
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based
-			 * on the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes :
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ * result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need
+ * to convert the tuple from subplan result rel to target table descriptor,
+ * and for INSERTs, we need to convert the inserted tuple from leaf partition
+ * to the target table descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
+		/*
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
+		 */
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based
+		 * on the partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		Assert(mtstate->mt_partition_tuple_routing != NULL);
+		resultRelInfos = mtstate->mt_partition_tuple_routing->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+
+		Assert(ptr && ptr->subplan_partition_offsets != NULL);
+		leaf_index = ptr->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < ptr->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1660,15 +1949,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1785,7 +2072,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1830,9 +2118,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
+	PartitionTupleRouting *ptr = NULL;
+	int			num_partitions = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1905,6 +2198,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1942,31 +2245,36 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
 
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
+	{
 		ExecSetupPartitionTupleRouting(rel,
+									   mtstate->resultRelInfo,
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-		mtstate->mt_num_dispatch = num_parted;
-		mtstate->mt_partitions = partitions;
-		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
-		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+									   &mtstate->mt_partition_tuple_routing);
+
+		ptr = mtstate->mt_partition_tuple_routing;
+		num_partitions = ptr->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1977,6 +2285,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -2006,45 +2326,57 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
+	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2061,7 +2393,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2098,22 +2430,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		firstReturningList = linitial(node->returningLists);
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2358,6 +2703,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2376,29 +2722,46 @@ ExecEndModifyTable(ModifyTableState *node)
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
-	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
+	 * Remember ptr->partition_dispatch_info[0] corresponds to the root
 	 * partitioned table, which we must not try to close, because it is the
 	 * main target table of the query that will be closed by ExecEndPlan().
 	 * Also, tupslot is NULL for the root partitioned table.
 	 */
-	for (i = 1; i < node->mt_num_dispatch; i++)
+	if (node->mt_partition_tuple_routing)
 	{
-		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+		PartitionTupleRouting *ptr = node->mt_partition_tuple_routing;
 
-		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
-	}
-	for (i = 0; i < node->mt_num_partitions; i++)
-	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+		for (i = 1; i < ptr->num_dispatch; i++)
+		{
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
+			heap_close(pd->reldesc, NoLock);
+			ExecDropSingleTupleTableSlot(pd->tupslot);
+		}
+		for (i = 0; i < ptr->num_partitions; i++)
+		{
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If this result rel is one of the subplan result rels, let
+			 * ExecEndPlan() close it. For INSERTs, this does not apply because
+			 * leaf partition result rels are always newly allocated.
+			 */
+			if (operation == CMD_UPDATE &&
+				resultRelInfo >= node->resultRelInfo &&
+				resultRelInfo < node->resultRelInfo + node->mt_nplans)
+				continue;
 
-	/* Release the standalone partition tuple descriptor, if any */
-	if (node->mt_partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
+			ExecCloseIndices(resultRelInfo);
+			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+		}
+
+		/* Release the standalone partition tuple descriptors, if any */
+		if (ptr->root_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->root_tuple_slot);
+		if (ptr->partition_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d9ff8a7..0f2f970 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2261,6 +2262,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(is_partition_key_update);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 2866fd7..6e2e3dd 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(is_partition_key_update);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c97ee24..a5e71a2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2103,6 +2104,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2525,6 +2527,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(is_partition_key_update);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 7eb67fc0..9542b94 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 44f6b03..be34463 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1359,7 +1359,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1397,7 +1397,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d445477..549821e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2371,6 +2372,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6428,6 +6430,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6454,6 +6457,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ef2eaea..ce26bbe 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6152,17 +6156,22 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets is_partition_key_update
+ *		to true if any of the root rte's updated columns is a partition key.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (is_partition_key_update)
+		*is_partition_key_update = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6170,6 +6179,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (is_partition_key_update)
+				*is_partition_key_update = pc->is_partition_key_update;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f620243..7babb35 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1466,16 +1467,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		is_partition_key_update = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &is_partition_key_update);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1492,6 +1496,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->is_partition_key_update = is_partition_key_update;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1568,7 +1573,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1583,6 +1589,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*is_partition_key_update)
+		*is_partition_key_update =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1622,7 +1639,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   is_partition_key_update);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 68dee0f..0ce5339 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3207,6 +3207,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the named relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3220,6 +3222,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3287,6 +3290,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 295e9d2..c6fee08 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -54,12 +54,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
-
+extern void pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols);
+extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
+							bool *used_in_expr);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 64e5aab..7e69c48 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -49,14 +49,51 @@ typedef struct PartitionDispatchData
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to execute
+ * tuple-routing between partitions.
+ *
+ * partition_dispatch_info		Array of PartitionDispatch objects with one
+ *								entry for every partitioned table in the
+ *								partition tree.
+ * num_dispatch					number of partitioned tables in the partition
+ *								tree (= length of partition_dispatch_info[])
+ * partitions					Array of ResultRelInfo* objects with one entry
+ *								for every leaf partition in the partition tree.
+ * num_partitions				Number of leaf partitions in the partition tree
+ *								(= 'partitions' array length)
+ * parentchild_tupconv_maps		Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the root table's
+ *								rowtype to a leaf partition's rowtype after
+ *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
+ * partition_tuple_slot			TupleTableSlot to be used to manipulate any
+ *								given leaf partition's rowtype after that
+ *								partition is chosen for insertion by
+ *								tuple-routing.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	TupleConversionMap **parentchild_tupconv_maps;
+	int		   *subplan_partition_offsets;
+	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
+} PartitionTupleRouting;
+
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions);
+							   PartitionTupleRouting **partition_tuple_routing);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b5578f5..5a385e2 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e05bc04..64cf3dd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -976,21 +976,15 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
-	struct PartitionDispatchData **mt_partition_dispatch_info;
-	/* Tuple-routing support info */
-	int			mt_num_dispatch;	/* Number of entries in the above array */
-	int			mt_num_partitions;	/* Number of members in the following
-									 * arrays */
-	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
-	TupleTableSlot *mt_partition_tuple_slot;
+	struct PartitionTupleRouting *mt_partition_tuple_routing; /* Tuple-routing support info */
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
+	/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 9b38d44..b36dafc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9e68e65..43d0164 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2117,6 +2118,9 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		is_partition_key_update;	/* is the partition key of any of
+											 * the partitioned tables
+											 * updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..39ce47d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2801bfd..9f0533c 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..dd6242b 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,371 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +570,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +633,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +759,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..10c10c7 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,233 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (d);
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +342,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +371,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +470,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#206

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#205)

2 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 29 November 2017 at 17:25, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Also, this
patch contains the incremental changes that were attached in the patch
encapsulate_partinfo.patch attached in [1]. In the next version, I
will extract them out again and keep them as a separate preparatory
patch.

As mentioned above, attached is
encapsulate_partinfo_preparatory.patch. This addresses David Rowley's
request to move all the partition-related information into new
structure PartitionTupleRouting, so that for
ExecSetupPartitionTupleRouting(), we could pass pointer to this
structure instead of the many parameters that we currently pass: [1]/messages/by-id/CAJ3gD9f86H64e4OCjFFszWW7f4EeyriSaFL8SvJs2yOUbc8VEw@mail.gmail.com

The main update-partition-key patch is to be applied over the above
preparatory patch. Attached is its v27 version. This version addresses
Thomas Munro's comments :

On 14 November 2017 at 01:32, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Nov 10, 2017 at 4:42 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is v23 patch that has just the above changes (and also
rebased on hash-partitioning changes, like update.sql). I am still
doing some sanity testing on this, although regression passes.

The test coverage[1] is 96.62%. Nice work. Here are the bits that
aren't covered:

In partition.c's pull_child_partition_columns(), the following loop is
never run:
+       foreach(lc, partexprs)
+       {
+               Node       *expr = (Node *) lfirst(lc);
+
+               pull_varattnos(expr, 1, &child_keycols);
+       }

In update.sql, part_c_100_200 is now partitioned by range(abs(d)). So
now the new function has_partition_atttrs() (in recent patch versions,
this function has replaced pull_child_partition_columns) goes through
the above code segment. This was indeed an important part left
uncovered. Thanks.

In nodeModifyTable.c, the following conditional branches are never run:

if (mtstate->mt_oc_transition_capture != NULL)
+               {
+                       Assert(mtstate->mt_is_tupconv_perpart == true);
mtstate->mt_oc_transition_capture->tcs_map =
-
mtstate->mt_transition_tupconv_maps[leaf_part_index];
+
mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+               }

I think this code segment never hits even without the patch. For
partitions, ON CONFLICT is not supported, and this code segment runs
only for partitions.

if (node->mt_oc_transition_capture != NULL)
{
-
Assert(node->mt_transition_tupconv_maps != NULL);

node->mt_oc_transition_capture->tcs_map =
-
node->mt_transition_tupconv_maps[node->mt_whichplan];
+
tupconv_map_for_subplan(node, node->mt_whichplan);
}

Here also, I verified that none of the regression tests hits this
segment. The reason might be : this segment is run when an UPDATE
starts with the next subplan, and mtstate->mt_oc_transition_capture is
never allocated for UPDATEs.

[1]: /messages/by-id/CAJ3gD9f86H64e4OCjFFszWW7f4EeyriSaFL8SvJs2yOUbc8VEw@mail.gmail.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

encapsulate_partinfo_preparatory.patchapplication/octet-stream; name=encapsulate_partinfo_preparatory.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 13eb9e3..61ead28 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,12 +165,9 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;	/* Number of entries in the above array */
-	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo **partitions; /* Per partition result relation pointers */
-	TupleConversionMap **partition_tupconv_maps;
-	TupleTableSlot *partition_tuple_slot;
+	PartitionTupleRouting *partition_tuple_routing;
+	/* Tuple-routing support info */
+
 	TransitionCaptureState *transition_capture;
 	TupleConversionMap **transition_tupconv_maps;
 
@@ -2471,27 +2468,14 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
+		PartitionTupleRouting *ptr;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
 									   1,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		cstate->partition_dispatch_info = partition_dispatch_info;
-		cstate->num_dispatch = num_parted;
-		cstate->partitions = partitions;
-		cstate->num_partitions = num_partitions;
-		cstate->partition_tupconv_maps = partition_tupconv_maps;
-		cstate->partition_tuple_slot = partition_tuple_slot;
+									   &cstate->partition_tuple_routing);
+
+		ptr = cstate->partition_tuple_routing;
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2504,11 +2488,11 @@ CopyFrom(CopyState cstate)
 			int			i;
 
 			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * cstate->num_partitions);
-			for (i = 0; i < cstate->num_partitions; ++i)
+				palloc0(sizeof(TupleConversionMap *) * ptr->num_partitions);
+			for (i = 0; i < ptr->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(ptr->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2528,7 +2512,7 @@ CopyFrom(CopyState cstate)
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		cstate->partition_dispatch_info != NULL ||
+		cstate->partition_tuple_routing != NULL ||
 		cstate->volatile_defexprs)
 	{
 		useHeapMultiInsert = false;
@@ -2603,10 +2587,11 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
+		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
 			TupleConversionMap *map;
+			PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2617,11 +2602,11 @@ CopyFrom(CopyState cstate)
 			 * partition, respectively.
 			 */
 			leaf_part_index = ExecFindPartition(resultRelInfo,
-												cstate->partition_dispatch_info,
+												ptr->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < cstate->num_partitions);
+				   leaf_part_index < ptr->num_partitions);
 
 			/*
 			 * If this tuple is mapped to a partition that is not same as the
@@ -2639,7 +2624,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions[leaf_part_index];
+			resultRelInfo = ptr->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2686,7 +2671,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = cstate->partition_tupconv_maps[leaf_part_index];
+			map = ptr->partition_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2698,7 +2683,7 @@ CopyFrom(CopyState cstate)
 				 * point on.  Use a dedicated slot from this point on until
 				 * we're finished dealing with the partition.
 				 */
-				slot = cstate->partition_tuple_slot;
+				slot = ptr->partition_tuple_slot;
 				Assert(slot != NULL);
 				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -2850,8 +2835,9 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_dispatch_info)
+	if (cstate->partition_tuple_routing)
 	{
+		PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 		int			i;
 
 		/*
@@ -2860,23 +2846,23 @@ CopyFrom(CopyState cstate)
 		 * the main target table of COPY that will be closed eventually by
 		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
 		 */
-		for (i = 1; i < cstate->num_dispatch; i++)
+		for (i = 1; i < ptr->num_dispatch; i++)
 		{
-			PartitionDispatch pd = cstate->partition_dispatch_info[i];
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
 			heap_close(pd->reldesc, NoLock);
 			ExecDropSingleTupleTableSlot(pd->tupslot);
 		}
-		for (i = 0; i < cstate->num_partitions; i++)
+		for (i = 0; i < ptr->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions[i];
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
 		/* Release the standalone partition tuple descriptor */
-		ExecDropSingleTupleTableSlot(cstate->partition_tuple_slot);
+		ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
 
 	/* Close any trigger target relations */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 59a0ca4..4b9f451 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -42,22 +42,9 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo* objects with one entry for
- *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
- * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
- *		to manipulate any given leaf partition's rowtype after that partition
- *		is chosen by tuple-routing.
- * 'num_parted' receives the number of partitioned tables in the partition
- *		tree (= the number of entries in the 'pd' output array)
- * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *
+ * 'partition_tuple_routing' encapsulates all the partition related information
+ *		required to do tuple routing.
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
@@ -66,29 +53,30 @@ void
 ExecSetupPartitionTupleRouting(Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions)
+							   PartitionTupleRouting **partition_tuple_routing)
 {
 	TupleDesc	tupDesc = RelationGetDescr(rel);
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
 	ResultRelInfo *leaf_part_rri;
+	PartitionTupleRouting *ptr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
 	 */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
-	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+	ptr = *partition_tuple_routing =
+		(PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	ptr->partition_dispatch_info =
+		RelationGetPartitionDispatchInfo(rel, &ptr->num_dispatch, &leaf_parts);
+	ptr->num_partitions = list_length(leaf_parts);
+	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	ptr->partition_tupconv_maps =
+		(TupleConversionMap **) palloc0(ptr->num_partitions *
+										sizeof(TupleConversionMap *));
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -96,9 +84,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 * (such as ModifyTableState) and released when the node finishes
 	 * processing.
 	 */
-	*partition_tuple_slot = MakeTupleTableSlot();
+	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
+	leaf_part_rri = (ResultRelInfo *) palloc0(ptr->num_partitions *
 											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
@@ -118,7 +106,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->partition_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
 		InitResultRelInfo(leaf_part_rri,
@@ -144,7 +132,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri++;
 		i++;
 	}
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1e3ece9..16789fa 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -279,32 +279,33 @@ ExecInsert(ModifyTableState *mtstate,
 	resultRelInfo = estate->es_result_relation_info;
 
 	/* Determine the partition to heap_insert the tuple into */
-	if (mtstate->mt_partition_dispatch_info)
+	if (mtstate->mt_partition_tuple_routing)
 	{
 		int			leaf_part_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * ptr->partitions[] and ptr->partition_tupconv_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
-											mtstate->mt_partition_dispatch_info,
+											ptr->partition_dispatch_info,
 											slot,
 											estate);
 		Assert(leaf_part_index >= 0 &&
-			   leaf_part_index < mtstate->mt_num_partitions);
+			   leaf_part_index < ptr->num_partitions);
 
 		/*
 		 * Save the old ResultRelInfo and switch to the one corresponding to
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
+		resultRelInfo = ptr->partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -352,7 +353,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
+		map = ptr->partition_tupconv_maps[leaf_part_index];
 		if (map)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -364,7 +365,7 @@ ExecInsert(ModifyTableState *mtstate,
 			 * on, until we're finished dealing with the partition. Use the
 			 * dedicated slot for that.
 			 */
-			slot = mtstate->mt_partition_tuple_slot;
+			slot = ptr->partition_tuple_slot;
 			Assert(slot != NULL);
 			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -1500,9 +1501,10 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		mtstate->mt_oc_transition_capture != NULL)
 	{
 		int			numResultRelInfos;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
+		numResultRelInfos = (ptr != NULL ?
+							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
 		/*
@@ -1515,13 +1517,13 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
 		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (ptr != NULL)
 		{
 			/*
 			 * For tuple routing among partitions, we need TupleDescs based on
 			 * the partition routing table.
 			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+			ResultRelInfo **resultRelInfos = ptr->partitions;
 
 			for (i = 0; i < numResultRelInfos; ++i)
 			{
@@ -1833,6 +1835,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	PartitionTupleRouting *ptr = NULL;
+	int			num_partitions = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1946,27 +1950,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (operation == CMD_INSERT &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
-
 		ExecSetupPartitionTupleRouting(rel,
 									   node->nominalRelation,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-		mtstate->mt_num_dispatch = num_parted;
-		mtstate->mt_partitions = partitions;
-		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
-		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+									   &mtstate->mt_partition_tuple_routing);
+
+		ptr = mtstate->mt_partition_tuple_routing;
+		num_partitions = ptr->num_partitions;
 	}
 
 	/*
@@ -2009,7 +1999,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
 	 * cases are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
+	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
 		List	   *wcoList;
 		PlanState  *plan;
@@ -2026,14 +2016,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			   mtstate->mt_nplans == 1);
 		wcoList = linitial(node->withCheckOptionLists);
 		plan = mtstate->mt_plans[0];
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2102,12 +2092,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * are handled above.
 		 */
 		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2376,29 +2366,33 @@ ExecEndModifyTable(ModifyTableState *node)
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
-	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
+	 * Remember ptr->partition_dispatch_info[0] corresponds to the root
 	 * partitioned table, which we must not try to close, because it is the
 	 * main target table of the query that will be closed by ExecEndPlan().
 	 * Also, tupslot is NULL for the root partitioned table.
 	 */
-	for (i = 1; i < node->mt_num_dispatch; i++)
+	if (node->mt_partition_tuple_routing)
 	{
-		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+		PartitionTupleRouting *ptr = node->mt_partition_tuple_routing;
 
-		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
-	}
-	for (i = 0; i < node->mt_num_partitions; i++)
-	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+		for (i = 1; i < ptr->num_dispatch; i++)
+		{
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
+			heap_close(pd->reldesc, NoLock);
+			ExecDropSingleTupleTableSlot(pd->tupslot);
+		}
+		for (i = 0; i < ptr->num_partitions; i++)
+		{
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+			ExecCloseIndices(resultRelInfo);
+			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+		}
 
-	/* Release the standalone partition tuple descriptor, if any */
-	if (node->mt_partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
+		/* Release the standalone partition tuple descriptor, if any */
+		if (ptr->partition_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 43ca990..8a7cedf 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -49,14 +49,44 @@ typedef struct PartitionDispatchData
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to execute
+ * tuple-routing between partitions.
+ *
+ * partition_dispatch_info		Array of PartitionDispatch objects with one
+ *								entry for every partitioned table in the
+ *								partition tree.
+ * num_dispatch					number of partitioned tables in the partition
+ *								tree (= length of partition_dispatch_info[])
+ * partitions					Array of ResultRelInfo* objects with one entry
+ *								for every leaf partition in the partition tree.
+ * num_partitions				Number of leaf partitions in the partition tree
+ *								(= 'partitions' array length)
+ * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the root table's
+ *								rowtype to a leaf partition's rowtype after
+ *								tuple routing is done)
+ * partition_tuple_slot			TupleTableSlot to be used to manipulate any
+ *								given leaf partition's rowtype after that
+ *								partition is chosen for insertion by
+ *								tuple-routing.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	TupleConversionMap **partition_tupconv_maps;
+	TupleTableSlot *partition_tuple_slot;
+} PartitionTupleRouting;
+
 extern void ExecSetupPartitionTupleRouting(Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions);
+							   PartitionTupleRouting **partition_tuple_routing);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e05bc04..6b481b4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -976,15 +976,8 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
-	struct PartitionDispatchData **mt_partition_dispatch_info;
+	struct PartitionTupleRouting *mt_partition_tuple_routing;
 	/* Tuple-routing support info */
-	int			mt_num_dispatch;	/* Number of entries in the above array */
-	int			mt_num_partitions;	/* Number of members in the following
-									 * arrays */
-	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
-	TupleTableSlot *mt_partition_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;

update-partition-key_v27.patchapplication/octet-stream; name=update-partition-key_v27.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index e6f50ec..1517757 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3297,9 +3302,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..3c665f0 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..aaffc4d 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by <command>INSERT</command> into the
+    new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 2bf8117..4f34b03 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1441,7 +1441,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1454,8 +1455,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1464,14 +1465,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2594,6 +2595,69 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 }
 
 /*
+ * Checks if any of the 'attnums' is a partition key attribute for rel
+ *
+ * Sets *used_in_expr if any of the 'attnums' is found to be referenced in some
+ * partition key expression.  It's possible for a column to be both used
+ * directly and as part of an expression; if that happens, *used_in_expr may
+ * end up as either true or false.  That's OK for current uses of this
+ * function, because *used_in_expr is only used to tailor the error message
+ * text.
+ */
+bool
+has_partition_attrs(Relation rel, Bitmapset *attnums, bool *used_in_expr)
+{
+	PartitionKey key;
+	int			partnatts;
+	List	   *partexprs;
+	ListCell   *partexprs_item;
+	int			i;
+
+	if (attnums == NULL || rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		return false;
+
+	key = RelationGetPartitionKey(rel);
+	partnatts = get_partition_natts(key);
+	partexprs = get_partition_exprs(key);
+
+	partexprs_item = list_head(partexprs);
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+		{
+			if (bms_is_member(partattno - FirstLowInvalidHeapAttributeNumber,
+							  attnums))
+			{
+				if (used_in_expr)
+					*used_in_expr = false;
+				return true;
+			}
+		}
+		else
+		{
+			/* Arbitrary expression */
+			Node	   *expr = (Node *) lfirst(partexprs_item);
+			Bitmapset  *expr_attrs = NULL;
+
+			/* Find all attributes referenced */
+			pull_varattnos(expr, 1, &expr_attrs);
+			partexprs_item = lnext(partexprs_item);
+
+			if (bms_overlap(attnums, expr_attrs))
+			{
+				if (used_in_expr)
+					*used_in_expr = true;
+				return true;
+			}
+		}
+	}
+
+	return false;
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 61ead28..322e326 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2471,6 +2471,8 @@ CopyFrom(CopyState cstate)
 		PartitionTupleRouting *ptr;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &cstate->partition_tuple_routing);
@@ -2671,7 +2673,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = ptr->partition_tupconv_maps[leaf_part_index];
+			map = ptr->parentchild_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2734,7 +2736,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d979ce2..64c2185 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -468,7 +468,6 @@ static void RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid,
 								Oid oldRelOid, void *arg);
 static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
 								 Oid oldrelid, void *arg);
-static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
 static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
 static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 					  List **partexprs, Oid *partopclass, Oid *partcollation, char strategy);
@@ -6492,68 +6491,6 @@ ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
 }
 
 /*
- * Checks if attnum is a partition attribute for rel
- *
- * Sets *used_in_expr if attnum is found to be referenced in some partition
- * key expression.  It's possible for a column to be both used directly and
- * as part of an expression; if that happens, *used_in_expr may end up as
- * either true or false.  That's OK for current uses of this function, because
- * *used_in_expr is only used to tailor the error message text.
- */
-static bool
-is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr)
-{
-	PartitionKey key;
-	int			partnatts;
-	List	   *partexprs;
-	ListCell   *partexprs_item;
-	int			i;
-
-	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		return false;
-
-	key = RelationGetPartitionKey(rel);
-	partnatts = get_partition_natts(key);
-	partexprs = get_partition_exprs(key);
-
-	partexprs_item = list_head(partexprs);
-	for (i = 0; i < partnatts; i++)
-	{
-		AttrNumber	partattno = get_partition_col_attnum(key, i);
-
-		if (partattno != 0)
-		{
-			if (attnum == partattno)
-			{
-				if (used_in_expr)
-					*used_in_expr = false;
-				return true;
-			}
-		}
-		else
-		{
-			/* Arbitrary expression */
-			Node	   *expr = (Node *) lfirst(partexprs_item);
-			Bitmapset  *expr_attrs = NULL;
-
-			/* Find all attributes referenced */
-			pull_varattnos(expr, 1, &expr_attrs);
-			partexprs_item = lnext(partexprs_item);
-
-			if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
-							  expr_attrs))
-			{
-				if (used_in_expr)
-					*used_in_expr = true;
-				return true;
-			}
-		}
-	}
-
-	return false;
-}
-
-/*
  * Return value is the address of the dropped column.
  */
 static ObjectAddress
@@ -6613,7 +6550,9 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 						colName)));
 
 	/* Don't drop columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
@@ -8837,7 +8776,9 @@ ATPrepAlterColumnType(List **wqueue,
 						colName)));
 
 	/* Don't alter columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..73ec872 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for row being inserted,
+		 * whereas newtup is NULL when the event is for row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,17 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so either can be NULL, not both.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dbaa47f..5ec92d5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 4b9f451..f0ed6ea 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -41,6 +41,13 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' contains the UPDATE per-subplan result rels. For the output
+ *		param 'partitions', we don't allocate new ResultRelInfo objects for
+ *		leaf partitions for which they are already available in 'update_rri'.
+ *
+ * 'num_update_rri' is the number of elements in 'update_rri' array or zero for
+ *      INSERT.
+ *
  * Output arguments:
  *
  * 'partition_tuple_routing' encapsulates all the partition related information
@@ -51,6 +58,8 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionTupleRouting **partition_tuple_routing)
@@ -59,7 +68,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL;
+	int			update_rri_index = 0;
+	bool		is_update = (num_update_rri > 0);
 	PartitionTupleRouting *ptr;
 
 	/*
@@ -74,10 +85,48 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	ptr->num_partitions = list_length(leaf_parts);
 	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	ptr->partition_tupconv_maps =
+	ptr->parentchild_tupconv_maps =
 		(TupleConversionMap **) palloc0(ptr->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	if (is_update)
+	{
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		ptr->subplan_partition_offsets = palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		ptr->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(ptr->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -86,37 +135,80 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(ptr->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				ptr->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		ptr->partition_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->parentchild_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -132,9 +224,15 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		ptr->partitions[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
 }
 
 /*
@@ -165,8 +263,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 16789fa..31fda19 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +251,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +308,9 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,14 +328,13 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * ptr->partitions[] and ptr->partition_tupconv_maps[] that will get us
-		 * the ResultRelInfo and TupleConversionMap for the partition,
+		 * ptr->partitions[] and ptr->parentchild_tupconv_maps[] that will get
+		 * us the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
@@ -332,8 +376,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -346,30 +392,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = ptr->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = ptr->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  ptr->parentchild_tupconv_maps[leaf_part_index],
+										  tuple,
+										  ptr->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -487,7 +524,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -623,9 +660,32 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tables, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +739,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tuple_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +748,12 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
+
+	if (tuple_deleted)
+		*tuple_deleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +918,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (tuple_deleted)
+		*tuple_deleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1043,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1044,12 +1140,117 @@ lreplace:;
 								 resultRelInfo, slot, estate);
 
 		/*
+		 * If a partition check fails, try to move the row into the right
+		 * partition.
+		 */
+		if (resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate))
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (ptr == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip INSERT as
+			 * well, otherwise, there will be effectively one new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * UPDATEs set the transition capture map only when a new subplan
+			 * is chosen.  But for INSERTs, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INESRT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  ptr->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
+
+		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1678,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1507,55 +1707,142 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(ptr != NULL));
+
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (ptr != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = ptr->partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes :
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ * result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need
+ * to convert the tuple from subplan result rel to target table descriptor,
+ * and for INSERTs, we need to convert the inserted tuple from leaf partition
+ * to the target table descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
+		/*
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
+		 */
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based on the
+		 * partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		Assert(mtstate->mt_partition_tuple_routing != NULL);
+		resultRelInfos = mtstate->mt_partition_tuple_routing->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+
+		Assert(ptr && ptr->subplan_partition_offsets != NULL);
+		leaf_index = ptr->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < ptr->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1949,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2072,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1832,9 +2118,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 	PartitionTupleRouting *ptr = NULL;
 	int			num_partitions = 0;
 
@@ -1909,6 +2198,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1946,17 +2245,36 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		ExecSetupPartitionTupleRouting(rel,
+									   mtstate->resultRelInfo,
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &mtstate->mt_partition_tuple_routing);
 
 		ptr = mtstate->mt_partition_tuple_routing;
 		num_partitions = ptr->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1967,6 +2285,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1996,26 +2326,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2024,17 +2357,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2051,7 +2393,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2088,22 +2430,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2348,6 +2703,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2385,11 +2741,24 @@ ExecEndModifyTable(ModifyTableState *node)
 		for (i = 0; i < ptr->num_partitions; i++)
 		{
 			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If this result rel is one of the subplan result rels, let
+			 * ExecEndPlan() close it. For INSERTs, this does not apply because
+			 * leaf partition result rels are always newly allocated.
+			 */
+			if (operation == CMD_UPDATE &&
+				resultRelInfo >= node->resultRelInfo &&
+				resultRelInfo < node->resultRelInfo + node->mt_nplans)
+				continue;
+
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
-		/* Release the standalone partition tuple descriptor, if any */
+		/* Release the standalone partition tuple descriptors, if any */
+		if (ptr->root_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->root_tuple_slot);
 		if (ptr->partition_tuple_slot)
 			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d9ff8a7..0f2f970 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2261,6 +2262,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(is_partition_key_update);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 2866fd7..6e2e3dd 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(is_partition_key_update);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c97ee24..a5e71a2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2103,6 +2104,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2525,6 +2527,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(is_partition_key_update);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 7eb67fc0..9542b94 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 44f6b03..be34463 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1359,7 +1359,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1397,7 +1397,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d445477..549821e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2371,6 +2372,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6428,6 +6430,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6454,6 +6457,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ef2eaea..ce26bbe 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6152,17 +6156,22 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets is_partition_key_update
+ *		to true if any of the root rte's updated columns is a partition key.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (is_partition_key_update)
+		*is_partition_key_update = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6170,6 +6179,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (is_partition_key_update)
+				*is_partition_key_update = pc->is_partition_key_update;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f620243..7babb35 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1466,16 +1467,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		is_partition_key_update = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &is_partition_key_update);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1492,6 +1496,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->is_partition_key_update = is_partition_key_update;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1568,7 +1573,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1583,6 +1589,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*is_partition_key_update)
+		*is_partition_key_update =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1622,7 +1639,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   is_partition_key_update);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 02bbbc0..75288da 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3207,6 +3207,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the named relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3220,6 +3222,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3287,6 +3290,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2983cfa..ff49ecc 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -54,12 +54,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
-
+extern void pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols);
+extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
+							bool *used_in_expr);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 8a7cedf..24d66dc 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -62,11 +62,14 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								for every leaf partition in the partition tree.
  * num_partitions				Number of leaf partitions in the partition tree
  *								(= 'partitions' array length)
- * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ * parentchild_tupconv_maps		Array of TupleConversionMap objects with one
  *								entry for every leaf partition (required to
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -79,11 +82,15 @@ typedef struct PartitionTupleRouting
 	int			num_dispatch;
 	ResultRelInfo **partitions;
 	int			num_partitions;
-	TupleConversionMap **partition_tupconv_maps;
+	TupleConversionMap **parentchild_tupconv_maps;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionTupleRouting **partition_tuple_routing);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b5578f5..5a385e2 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6b481b4..23f985e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -982,8 +982,10 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
+	/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 9b38d44..b36dafc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 51df8e9..1448663 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2117,6 +2118,9 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		is_partition_key_update;	/* is the partition key of any of
+											 * the partitioned tables
+											 * updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..39ce47d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2801bfd..9f0533c 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..aaf5d53 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,371 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +570,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +633,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +759,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..cfade17 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,233 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +342,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +371,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +470,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#207

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#206)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

While addressing Thomas's point about test scenarios not yet covered,
I observed the following ...

Suppose an UPDATE RLS policy with a WITH CHECK clause is defined on
the target table. Now In ExecUpdate(), the corresponding WCO qual gets
executed *before* the partition constraint check, as per existing
behaviour. And the qual succeeds. And then because of partition-key
updated, the row is moved to another partition. Here, suppose there is
a BR INSERT trigger which modifies the row, and the resultant row
actually would *not* pass the UPDATE RLS policy. But for this
partition, since it is an INSERT, only INSERT RLS WCO quals are
executed.

So effectively, with a user-perspective, an RLS WITH CHECK policy that
was defined to reject an updated row, is getting updated successfully.
This is because the policy is not checked *after* a row trigger in the
new partition is executed.

Attached is a test case that reproduces this issue.

I think, in case of row-movement, we should defer calling
ExecWithCheckOptions() until the row is being inserted using
ExecInsert(). And then in ExecInsert(), ExecWithCheckOptions() should
be called using WCO_RLS_UPDATE_CHECK rather than WCO_RLS_INSERT_CHECK
(I recall Amit Langote was of this opinion) as below :

--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -510,7 +510,9 @@ ExecInsert(ModifyTableState *mtstate,
  * we are looking for at this point.
  */
  if (resultRelInfo->ri_WithCheckOptions != NIL)
-     ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
+        ExecWithCheckOptions((mtstate->operation == CMD_UPDATE ?
+                             WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK),
                              resultRelInfo, slot, estate);

It can be argued that since in case of triggers we always execute
INSERT row triggers for rows inserted as part of update-row-movement,
we should be consistent and execute INSERT WCOs and not UPDATE WCOs
for such rows. But note that, the row triggers we execute are defined
on the leaf partitions. But the RLS policies being executed are
defined for the target partitioned table, and not the leaf partition.
Hence it makes sense to execute them as per the original operation on
the target table. This is similar to why we execute UPDATE statement
triggers even when the row is eventually inserted into another
partition. This is because UPDATE statement trigger was defined for
the target table, not the leaf partition.

Barring any objections, I am going to send a revised patch that fixes
the above issue as described.

Thanks
-Amit Khandekar

#208

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#207)

2 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 30 November 2017 at 18:56, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

While addressing Thomas's point about test scenarios not yet covered,
I observed the following ...

Suppose an UPDATE RLS policy with a WITH CHECK clause is defined on
the target table. Now In ExecUpdate(), the corresponding WCO qual gets
executed *before* the partition constraint check, as per existing
behaviour. And the qual succeeds. And then because of partition-key
updated, the row is moved to another partition. Here, suppose there is
a BR INSERT trigger which modifies the row, and the resultant row
actually would *not* pass the UPDATE RLS policy. But for this
partition, since it is an INSERT, only INSERT RLS WCO quals are
executed.

So effectively, with a user-perspective, an RLS WITH CHECK policy that
was defined to reject an updated row, is getting updated successfully.
This is because the policy is not checked *after* a row trigger in the
new partition is executed.

Attached is a test case that reproduces this issue.

I think, in case of row-movement, we should defer calling
ExecWithCheckOptions() until the row is being inserted using
ExecInsert(). And then in ExecInsert(), ExecWithCheckOptions() should
be called using WCO_RLS_UPDATE_CHECK rather than WCO_RLS_INSERT_CHECK
(I recall Amit Langote was of this opinion) as below :
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -510,7 +510,9 @@ ExecInsert(ModifyTableState *mtstate,
* we are looking for at this point.
*/
if (resultRelInfo->ri_WithCheckOptions != NIL)
-     ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
+        ExecWithCheckOptions((mtstate->operation == CMD_UPDATE ?
+                             WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK),
resultRelInfo, slot, estate);

Attached is v28 patch which has the fix for this issue as described
above. In ExecUpdate(), if partition constraint fails, we skip
ExecWithCheckOptions (), and later in ExecInsert() it gets called with
WCO_RLS_UPDATE_CHECK.

Added a few test scenarios for the same, in regress/sql/update.sql.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

encapsulate_partinfo_preparatory.patchapplication/octet-stream; name=encapsulate_partinfo_preparatory.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 13eb9e3..61ead28 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -165,12 +165,9 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;	/* Number of entries in the above array */
-	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo **partitions; /* Per partition result relation pointers */
-	TupleConversionMap **partition_tupconv_maps;
-	TupleTableSlot *partition_tuple_slot;
+	PartitionTupleRouting *partition_tuple_routing;
+	/* Tuple-routing support info */
+
 	TransitionCaptureState *transition_capture;
 	TupleConversionMap **transition_tupconv_maps;
 
@@ -2471,27 +2468,14 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
+		PartitionTupleRouting *ptr;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
 									   1,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		cstate->partition_dispatch_info = partition_dispatch_info;
-		cstate->num_dispatch = num_parted;
-		cstate->partitions = partitions;
-		cstate->num_partitions = num_partitions;
-		cstate->partition_tupconv_maps = partition_tupconv_maps;
-		cstate->partition_tuple_slot = partition_tuple_slot;
+									   &cstate->partition_tuple_routing);
+
+		ptr = cstate->partition_tuple_routing;
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2504,11 +2488,11 @@ CopyFrom(CopyState cstate)
 			int			i;
 
 			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * cstate->num_partitions);
-			for (i = 0; i < cstate->num_partitions; ++i)
+				palloc0(sizeof(TupleConversionMap *) * ptr->num_partitions);
+			for (i = 0; i < ptr->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(ptr->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2528,7 +2512,7 @@ CopyFrom(CopyState cstate)
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		cstate->partition_dispatch_info != NULL ||
+		cstate->partition_tuple_routing != NULL ||
 		cstate->volatile_defexprs)
 	{
 		useHeapMultiInsert = false;
@@ -2603,10 +2587,11 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
+		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
 			TupleConversionMap *map;
+			PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2617,11 +2602,11 @@ CopyFrom(CopyState cstate)
 			 * partition, respectively.
 			 */
 			leaf_part_index = ExecFindPartition(resultRelInfo,
-												cstate->partition_dispatch_info,
+												ptr->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < cstate->num_partitions);
+				   leaf_part_index < ptr->num_partitions);
 
 			/*
 			 * If this tuple is mapped to a partition that is not same as the
@@ -2639,7 +2624,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions[leaf_part_index];
+			resultRelInfo = ptr->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2686,7 +2671,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = cstate->partition_tupconv_maps[leaf_part_index];
+			map = ptr->partition_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2698,7 +2683,7 @@ CopyFrom(CopyState cstate)
 				 * point on.  Use a dedicated slot from this point on until
 				 * we're finished dealing with the partition.
 				 */
-				slot = cstate->partition_tuple_slot;
+				slot = ptr->partition_tuple_slot;
 				Assert(slot != NULL);
 				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -2850,8 +2835,9 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_dispatch_info)
+	if (cstate->partition_tuple_routing)
 	{
+		PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 		int			i;
 
 		/*
@@ -2860,23 +2846,23 @@ CopyFrom(CopyState cstate)
 		 * the main target table of COPY that will be closed eventually by
 		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
 		 */
-		for (i = 1; i < cstate->num_dispatch; i++)
+		for (i = 1; i < ptr->num_dispatch; i++)
 		{
-			PartitionDispatch pd = cstate->partition_dispatch_info[i];
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
 			heap_close(pd->reldesc, NoLock);
 			ExecDropSingleTupleTableSlot(pd->tupslot);
 		}
-		for (i = 0; i < cstate->num_partitions; i++)
+		for (i = 0; i < ptr->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions[i];
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
 		/* Release the standalone partition tuple descriptor */
-		ExecDropSingleTupleTableSlot(cstate->partition_tuple_slot);
+		ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
 
 	/* Close any trigger target relations */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 59a0ca4..4b9f451 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -42,22 +42,9 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo* objects with one entry for
- *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
- * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
- *		to manipulate any given leaf partition's rowtype after that partition
- *		is chosen by tuple-routing.
- * 'num_parted' receives the number of partitioned tables in the partition
- *		tree (= the number of entries in the 'pd' output array)
- * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *
+ * 'partition_tuple_routing' encapsulates all the partition related information
+ *		required to do tuple routing.
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
@@ -66,29 +53,30 @@ void
 ExecSetupPartitionTupleRouting(Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions)
+							   PartitionTupleRouting **partition_tuple_routing)
 {
 	TupleDesc	tupDesc = RelationGetDescr(rel);
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
 	ResultRelInfo *leaf_part_rri;
+	PartitionTupleRouting *ptr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
 	 */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
-	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+	ptr = *partition_tuple_routing =
+		(PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	ptr->partition_dispatch_info =
+		RelationGetPartitionDispatchInfo(rel, &ptr->num_dispatch, &leaf_parts);
+	ptr->num_partitions = list_length(leaf_parts);
+	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	ptr->partition_tupconv_maps =
+		(TupleConversionMap **) palloc0(ptr->num_partitions *
+										sizeof(TupleConversionMap *));
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -96,9 +84,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 * (such as ModifyTableState) and released when the node finishes
 	 * processing.
 	 */
-	*partition_tuple_slot = MakeTupleTableSlot();
+	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
+	leaf_part_rri = (ResultRelInfo *) palloc0(ptr->num_partitions *
 											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
@@ -118,7 +106,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->partition_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
 		InitResultRelInfo(leaf_part_rri,
@@ -144,7 +132,7 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri++;
 		i++;
 	}
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1e3ece9..16789fa 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -279,32 +279,33 @@ ExecInsert(ModifyTableState *mtstate,
 	resultRelInfo = estate->es_result_relation_info;
 
 	/* Determine the partition to heap_insert the tuple into */
-	if (mtstate->mt_partition_dispatch_info)
+	if (mtstate->mt_partition_tuple_routing)
 	{
 		int			leaf_part_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * ptr->partitions[] and ptr->partition_tupconv_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
-											mtstate->mt_partition_dispatch_info,
+											ptr->partition_dispatch_info,
 											slot,
 											estate);
 		Assert(leaf_part_index >= 0 &&
-			   leaf_part_index < mtstate->mt_num_partitions);
+			   leaf_part_index < ptr->num_partitions);
 
 		/*
 		 * Save the old ResultRelInfo and switch to the one corresponding to
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
+		resultRelInfo = ptr->partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -352,7 +353,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
+		map = ptr->partition_tupconv_maps[leaf_part_index];
 		if (map)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -364,7 +365,7 @@ ExecInsert(ModifyTableState *mtstate,
 			 * on, until we're finished dealing with the partition. Use the
 			 * dedicated slot for that.
 			 */
-			slot = mtstate->mt_partition_tuple_slot;
+			slot = ptr->partition_tuple_slot;
 			Assert(slot != NULL);
 			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -1500,9 +1501,10 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		mtstate->mt_oc_transition_capture != NULL)
 	{
 		int			numResultRelInfos;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
+		numResultRelInfos = (ptr != NULL ?
+							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
 		/*
@@ -1515,13 +1517,13 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
 		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (ptr != NULL)
 		{
 			/*
 			 * For tuple routing among partitions, we need TupleDescs based on
 			 * the partition routing table.
 			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+			ResultRelInfo **resultRelInfos = ptr->partitions;
 
 			for (i = 0; i < numResultRelInfos; ++i)
 			{
@@ -1833,6 +1835,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	PartitionTupleRouting *ptr = NULL;
+	int			num_partitions = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1946,27 +1950,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (operation == CMD_INSERT &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
-
 		ExecSetupPartitionTupleRouting(rel,
 									   node->nominalRelation,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-		mtstate->mt_num_dispatch = num_parted;
-		mtstate->mt_partitions = partitions;
-		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
-		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+									   &mtstate->mt_partition_tuple_routing);
+
+		ptr = mtstate->mt_partition_tuple_routing;
+		num_partitions = ptr->num_partitions;
 	}
 
 	/*
@@ -2009,7 +1999,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
 	 * cases are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
+	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
 		List	   *wcoList;
 		PlanState  *plan;
@@ -2026,14 +2016,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			   mtstate->mt_nplans == 1);
 		wcoList = linitial(node->withCheckOptionLists);
 		plan = mtstate->mt_plans[0];
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2102,12 +2092,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * are handled above.
 		 */
 		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2376,29 +2366,33 @@ ExecEndModifyTable(ModifyTableState *node)
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
-	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
+	 * Remember ptr->partition_dispatch_info[0] corresponds to the root
 	 * partitioned table, which we must not try to close, because it is the
 	 * main target table of the query that will be closed by ExecEndPlan().
 	 * Also, tupslot is NULL for the root partitioned table.
 	 */
-	for (i = 1; i < node->mt_num_dispatch; i++)
+	if (node->mt_partition_tuple_routing)
 	{
-		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+		PartitionTupleRouting *ptr = node->mt_partition_tuple_routing;
 
-		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
-	}
-	for (i = 0; i < node->mt_num_partitions; i++)
-	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+		for (i = 1; i < ptr->num_dispatch; i++)
+		{
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
+			heap_close(pd->reldesc, NoLock);
+			ExecDropSingleTupleTableSlot(pd->tupslot);
+		}
+		for (i = 0; i < ptr->num_partitions; i++)
+		{
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+			ExecCloseIndices(resultRelInfo);
+			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+		}
 
-	/* Release the standalone partition tuple descriptor, if any */
-	if (node->mt_partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
+		/* Release the standalone partition tuple descriptor, if any */
+		if (ptr->partition_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 43ca990..8a7cedf 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -49,14 +49,44 @@ typedef struct PartitionDispatchData
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to execute
+ * tuple-routing between partitions.
+ *
+ * partition_dispatch_info		Array of PartitionDispatch objects with one
+ *								entry for every partitioned table in the
+ *								partition tree.
+ * num_dispatch					number of partitioned tables in the partition
+ *								tree (= length of partition_dispatch_info[])
+ * partitions					Array of ResultRelInfo* objects with one entry
+ *								for every leaf partition in the partition tree.
+ * num_partitions				Number of leaf partitions in the partition tree
+ *								(= 'partitions' array length)
+ * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the root table's
+ *								rowtype to a leaf partition's rowtype after
+ *								tuple routing is done)
+ * partition_tuple_slot			TupleTableSlot to be used to manipulate any
+ *								given leaf partition's rowtype after that
+ *								partition is chosen for insertion by
+ *								tuple-routing.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	TupleConversionMap **partition_tupconv_maps;
+	TupleTableSlot *partition_tuple_slot;
+} PartitionTupleRouting;
+
 extern void ExecSetupPartitionTupleRouting(Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions);
+							   PartitionTupleRouting **partition_tuple_routing);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e05bc04..6b481b4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -976,15 +976,8 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
-	struct PartitionDispatchData **mt_partition_dispatch_info;
+	struct PartitionTupleRouting *mt_partition_tuple_routing;
 	/* Tuple-routing support info */
-	int			mt_num_dispatch;	/* Number of entries in the above array */
-	int			mt_num_partitions;	/* Number of members in the following
-									 * arrays */
-	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
-	TupleTableSlot *mt_partition_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;

update-partition-key_v28.patchapplication/octet-stream; name=update-partition-key_v28.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 9f58326..45c7120 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3297,9 +3302,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..3c665f0 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..aaffc4d 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by <command>INSERT</command> into the
+    new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 2bf8117..4f34b03 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1441,7 +1441,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1454,8 +1455,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1464,14 +1465,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2594,6 +2595,69 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 }
 
 /*
+ * Checks if any of the 'attnums' is a partition key attribute for rel
+ *
+ * Sets *used_in_expr if any of the 'attnums' is found to be referenced in some
+ * partition key expression.  It's possible for a column to be both used
+ * directly and as part of an expression; if that happens, *used_in_expr may
+ * end up as either true or false.  That's OK for current uses of this
+ * function, because *used_in_expr is only used to tailor the error message
+ * text.
+ */
+bool
+has_partition_attrs(Relation rel, Bitmapset *attnums, bool *used_in_expr)
+{
+	PartitionKey key;
+	int			partnatts;
+	List	   *partexprs;
+	ListCell   *partexprs_item;
+	int			i;
+
+	if (attnums == NULL || rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		return false;
+
+	key = RelationGetPartitionKey(rel);
+	partnatts = get_partition_natts(key);
+	partexprs = get_partition_exprs(key);
+
+	partexprs_item = list_head(partexprs);
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+		{
+			if (bms_is_member(partattno - FirstLowInvalidHeapAttributeNumber,
+							  attnums))
+			{
+				if (used_in_expr)
+					*used_in_expr = false;
+				return true;
+			}
+		}
+		else
+		{
+			/* Arbitrary expression */
+			Node	   *expr = (Node *) lfirst(partexprs_item);
+			Bitmapset  *expr_attrs = NULL;
+
+			/* Find all attributes referenced */
+			pull_varattnos(expr, 1, &expr_attrs);
+			partexprs_item = lnext(partexprs_item);
+
+			if (bms_overlap(attnums, expr_attrs))
+			{
+				if (used_in_expr)
+					*used_in_expr = true;
+				return true;
+			}
+		}
+	}
+
+	return false;
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 61ead28..322e326 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2471,6 +2471,8 @@ CopyFrom(CopyState cstate)
 		PartitionTupleRouting *ptr;
 
 		ExecSetupPartitionTupleRouting(cstate->rel,
+									   NULL,
+									   0,
 									   1,
 									   estate,
 									   &cstate->partition_tuple_routing);
@@ -2671,7 +2673,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = ptr->partition_tupconv_maps[leaf_part_index];
+			map = ptr->parentchild_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2734,7 +2736,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d979ce2..64c2185 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -468,7 +468,6 @@ static void RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid,
 								Oid oldRelOid, void *arg);
 static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
 								 Oid oldrelid, void *arg);
-static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
 static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
 static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 					  List **partexprs, Oid *partopclass, Oid *partcollation, char strategy);
@@ -6492,68 +6491,6 @@ ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
 }
 
 /*
- * Checks if attnum is a partition attribute for rel
- *
- * Sets *used_in_expr if attnum is found to be referenced in some partition
- * key expression.  It's possible for a column to be both used directly and
- * as part of an expression; if that happens, *used_in_expr may end up as
- * either true or false.  That's OK for current uses of this function, because
- * *used_in_expr is only used to tailor the error message text.
- */
-static bool
-is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr)
-{
-	PartitionKey key;
-	int			partnatts;
-	List	   *partexprs;
-	ListCell   *partexprs_item;
-	int			i;
-
-	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		return false;
-
-	key = RelationGetPartitionKey(rel);
-	partnatts = get_partition_natts(key);
-	partexprs = get_partition_exprs(key);
-
-	partexprs_item = list_head(partexprs);
-	for (i = 0; i < partnatts; i++)
-	{
-		AttrNumber	partattno = get_partition_col_attnum(key, i);
-
-		if (partattno != 0)
-		{
-			if (attnum == partattno)
-			{
-				if (used_in_expr)
-					*used_in_expr = false;
-				return true;
-			}
-		}
-		else
-		{
-			/* Arbitrary expression */
-			Node	   *expr = (Node *) lfirst(partexprs_item);
-			Bitmapset  *expr_attrs = NULL;
-
-			/* Find all attributes referenced */
-			pull_varattnos(expr, 1, &expr_attrs);
-			partexprs_item = lnext(partexprs_item);
-
-			if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
-							  expr_attrs))
-			{
-				if (used_in_expr)
-					*used_in_expr = true;
-				return true;
-			}
-		}
-	}
-
-	return false;
-}
-
-/*
  * Return value is the address of the dropped column.
  */
 static ObjectAddress
@@ -6613,7 +6550,9 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 						colName)));
 
 	/* Don't drop columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
@@ -8837,7 +8776,9 @@ ATPrepAlterColumnType(List **wqueue,
 						colName)));
 
 	/* Don't alter columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..73ec872 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for row being inserted,
+		 * whereas newtup is NULL when the event is for row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,17 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so either can be NULL, not both.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dbaa47f..5ec92d5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 4b9f451..f0ed6ea 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -41,6 +41,13 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' contains the UPDATE per-subplan result rels. For the output
+ *		param 'partitions', we don't allocate new ResultRelInfo objects for
+ *		leaf partitions for which they are already available in 'update_rri'.
+ *
+ * 'num_update_rri' is the number of elements in 'update_rri' array or zero for
+ *      INSERT.
+ *
  * Output arguments:
  *
  * 'partition_tuple_routing' encapsulates all the partition related information
@@ -51,6 +58,8 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  */
 void
 ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionTupleRouting **partition_tuple_routing)
@@ -59,7 +68,9 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL;
+	int			update_rri_index = 0;
+	bool		is_update = (num_update_rri > 0);
 	PartitionTupleRouting *ptr;
 
 	/*
@@ -74,10 +85,48 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	ptr->num_partitions = list_length(leaf_parts);
 	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	ptr->partition_tupconv_maps =
+	ptr->parentchild_tupconv_maps =
 		(TupleConversionMap **) palloc0(ptr->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	if (is_update)
+	{
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		ptr->subplan_partition_offsets = palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		ptr->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(ptr->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -86,37 +135,80 @@ ExecSetupPartitionTupleRouting(Relation rel,
 	 */
 	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(ptr->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				ptr->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		ptr->partition_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->parentchild_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -132,9 +224,15 @@ ExecSetupPartitionTupleRouting(Relation rel,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		ptr->partitions[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
 }
 
 /*
@@ -165,8 +263,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fb538c0..e11f7cb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 16789fa..8810729 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +251,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +308,9 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,14 +328,13 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * ptr->partitions[] and ptr->partition_tupconv_maps[] that will get us
-		 * the ResultRelInfo and TupleConversionMap for the partition,
+		 * ptr->partitions[] and ptr->parentchild_tupconv_maps[] that will get
+		 * us the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
@@ -332,8 +376,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -346,30 +392,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = ptr->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = ptr->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  ptr->parentchild_tupconv_maps[leaf_part_index],
+										  tuple,
+										  ptr->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -450,6 +487,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -467,14 +505,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -487,7 +532,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -623,9 +668,32 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tables, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +747,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tuple_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +756,12 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
+
+	if (tuple_deleted)
+		*tuple_deleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +926,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (tuple_deleted)
+		*tuple_deleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1051,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1019,6 +1123,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1034,22 +1139,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case, we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (ptr == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip INSERT as
+			 * well, otherwise, there will be effectively one new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * UPDATEs set the transition capture map only when a new subplan
+			 * is chosen.  But for INSERTs, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INESRT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  ptr->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1702,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1507,55 +1731,142 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(ptr != NULL));
+
+		/*
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
+		 */
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
+
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes :
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ * result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need
+ * to convert the tuple from subplan result rel to target table descriptor,
+ * and for INSERTs, we need to convert the inserted tuple from leaf partition
+ * to the target table descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
+
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-		/* Choose the right set of partitions */
-		if (ptr != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = ptr->partitions;
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based on the
+		 * partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		Assert(mtstate->mt_partition_tuple_routing != NULL);
+		resultRelInfos = mtstate->mt_partition_tuple_routing->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+
+		Assert(ptr && ptr->subplan_partition_offsets != NULL);
+		leaf_index = ptr->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < ptr->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1973,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2096,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1832,9 +2142,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 	PartitionTupleRouting *ptr = NULL;
 	int			num_partitions = 0;
 
@@ -1909,6 +2222,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1946,17 +2269,36 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		ExecSetupPartitionTupleRouting(rel,
+									   mtstate->resultRelInfo,
+									   (operation == CMD_UPDATE ? nplans : 0),
 									   node->nominalRelation,
 									   estate,
 									   &mtstate->mt_partition_tuple_routing);
 
 		ptr = mtstate->mt_partition_tuple_routing;
 		num_partitions = ptr->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1967,6 +2309,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1996,26 +2350,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2024,17 +2381,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2051,7 +2417,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2088,22 +2454,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2348,6 +2727,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2385,11 +2765,24 @@ ExecEndModifyTable(ModifyTableState *node)
 		for (i = 0; i < ptr->num_partitions; i++)
 		{
 			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If this result rel is one of the subplan result rels, let
+			 * ExecEndPlan() close it. For INSERTs, this does not apply because
+			 * leaf partition result rels are always newly allocated.
+			 */
+			if (operation == CMD_UPDATE &&
+				resultRelInfo >= node->resultRelInfo &&
+				resultRelInfo < node->resultRelInfo + node->mt_nplans)
+				continue;
+
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
-		/* Release the standalone partition tuple descriptor, if any */
+		/* Release the standalone partition tuple descriptors, if any */
+		if (ptr->root_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->root_tuple_slot);
 		if (ptr->partition_tuple_slot)
 			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index aff9a62..2682cf2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2261,6 +2262,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(is_partition_key_update);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 2e869a9..b4b7639 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(is_partition_key_update);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c97ee24..a5e71a2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2103,6 +2104,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2525,6 +2527,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(is_partition_key_update);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 7eb67fc0..9542b94 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 44f6b03..be34463 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1359,7 +1359,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1397,7 +1397,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d445477..549821e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -278,6 +278,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2371,6 +2372,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6428,6 +6430,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6454,6 +6457,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ef2eaea..ce26bbe 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6152,17 +6156,22 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets is_partition_key_update
+ *		to true if any of the root rte's updated columns is a partition key.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (is_partition_key_update)
+		*is_partition_key_update = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6170,6 +6179,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (is_partition_key_update)
+				*is_partition_key_update = pc->is_partition_key_update;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f620243..7babb35 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1466,16 +1467,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		is_partition_key_update = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &is_partition_key_update);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1492,6 +1496,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->is_partition_key_update = is_partition_key_update;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1568,7 +1573,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1583,6 +1589,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*is_partition_key_update)
+		*is_partition_key_update =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1622,7 +1639,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   is_partition_key_update);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index bc0841b..965bd09 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3213,6 +3213,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the named relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3226,6 +3228,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3293,6 +3296,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2983cfa..ff49ecc 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -54,12 +54,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
-
+extern void pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols);
+extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
+							bool *used_in_expr);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 8a7cedf..24d66dc 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -62,11 +62,14 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								for every leaf partition in the partition tree.
  * num_partitions				Number of leaf partitions in the partition tree
  *								(= 'partitions' array length)
- * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ * parentchild_tupconv_maps		Array of TupleConversionMap objects with one
  *								entry for every leaf partition (required to
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -79,11 +82,15 @@ typedef struct PartitionTupleRouting
 	int			num_dispatch;
 	ResultRelInfo **partitions;
 	int			num_partitions;
-	TupleConversionMap **partition_tupconv_maps;
+	TupleConversionMap **parentchild_tupconv_maps;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
 extern void ExecSetupPartitionTupleRouting(Relation rel,
+							   ResultRelInfo *update_rri,
+							   int num_update_rri,
 							   Index resultRTindex,
 							   EState *estate,
 							   PartitionTupleRouting **partition_tuple_routing);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b5578f5..5a385e2 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6b481b4..23f985e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -982,8 +982,10 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
+	/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 9b38d44..b36dafc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 51df8e9..1448663 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1667,6 +1667,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2117,6 +2118,9 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		is_partition_key_update;	/* is the partition key of any of
+											 * the partitioned tables
+											 * updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e9ed16a..39ce47d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -238,6 +238,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2801bfd..9f0533c 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..0dfd3a6 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,441 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +640,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +703,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +829,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..53c6441 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,311 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+
+:init_range_parted;
+set session authorization regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +420,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +449,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +548,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#209

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#208)

2 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 1 December 2017 at 17:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is v28 patch which has the fix for this issue as described
above. In ExecUpdate(), if partition constraint fails, we skip
ExecWithCheckOptions (), and later in ExecInsert() it gets called with
WCO_RLS_UPDATE_CHECK.

Amit Langote informed me off-list, - along with suggestions for
changes - that my patch needs a rebase. Attached is the rebased
version. I have also bumped the patch version number (now v29),
because this as additional changes, again, suggested by Amit L :
Because ExecSetupPartitionTupleRouting() has mtstate parameter now,
no need to pass update_rri and num_update_rri, since they can be
retrieved from mtstate.

Also, the preparatory patch is also rebased.

Thanks Amit Langote.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

encapsulate_partinfo_preparatory_rebased.patchapplication/octet-stream; name=encapsulate_partinfo_preparatory_rebased.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 254be28..f1149ed 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -166,12 +166,9 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;	/* Number of entries in the above array */
-	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo **partitions; /* Per partition result relation pointers */
-	TupleConversionMap **partition_tupconv_maps;
-	TupleTableSlot *partition_tuple_slot;
+	PartitionTupleRouting *partition_tuple_routing;
+	/* Tuple-routing support info */
+
 	TransitionCaptureState *transition_capture;
 	TupleConversionMap **transition_tupconv_maps;
 
@@ -2472,28 +2469,15 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
+		PartitionTupleRouting *ptr;
 
 		ExecSetupPartitionTupleRouting(NULL,
 									   cstate->rel,
 									   1,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		cstate->partition_dispatch_info = partition_dispatch_info;
-		cstate->num_dispatch = num_parted;
-		cstate->partitions = partitions;
-		cstate->num_partitions = num_partitions;
-		cstate->partition_tupconv_maps = partition_tupconv_maps;
-		cstate->partition_tuple_slot = partition_tuple_slot;
+									   &cstate->partition_tuple_routing);
+
+		ptr = cstate->partition_tuple_routing;
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2506,11 +2490,11 @@ CopyFrom(CopyState cstate)
 			int			i;
 
 			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * cstate->num_partitions);
-			for (i = 0; i < cstate->num_partitions; ++i)
+				palloc0(sizeof(TupleConversionMap *) * ptr->num_partitions);
+			for (i = 0; i < ptr->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(ptr->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2530,7 +2514,7 @@ CopyFrom(CopyState cstate)
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		cstate->partition_dispatch_info != NULL ||
+		cstate->partition_tuple_routing != NULL ||
 		cstate->volatile_defexprs)
 	{
 		useHeapMultiInsert = false;
@@ -2605,10 +2589,11 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
+		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
 			TupleConversionMap *map;
+			PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2619,11 +2604,11 @@ CopyFrom(CopyState cstate)
 			 * partition, respectively.
 			 */
 			leaf_part_index = ExecFindPartition(resultRelInfo,
-												cstate->partition_dispatch_info,
+												ptr->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < cstate->num_partitions);
+				   leaf_part_index < ptr->num_partitions);
 
 			/*
 			 * If this tuple is mapped to a partition that is not same as the
@@ -2641,7 +2626,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions[leaf_part_index];
+			resultRelInfo = ptr->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2688,7 +2673,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = cstate->partition_tupconv_maps[leaf_part_index];
+			map = ptr->partition_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2700,7 +2685,7 @@ CopyFrom(CopyState cstate)
 				 * point on.  Use a dedicated slot from this point on until
 				 * we're finished dealing with the partition.
 				 */
-				slot = cstate->partition_tuple_slot;
+				slot = ptr->partition_tuple_slot;
 				Assert(slot != NULL);
 				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -2852,8 +2837,9 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_dispatch_info)
+	if (cstate->partition_tuple_routing)
 	{
+		PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 		int			i;
 
 		/*
@@ -2862,23 +2848,23 @@ CopyFrom(CopyState cstate)
 		 * the main target table of COPY that will be closed eventually by
 		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
 		 */
-		for (i = 1; i < cstate->num_dispatch; i++)
+		for (i = 1; i < ptr->num_dispatch; i++)
 		{
-			PartitionDispatch pd = cstate->partition_dispatch_info[i];
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
 			heap_close(pd->reldesc, NoLock);
 			ExecDropSingleTupleTableSlot(pd->tupslot);
 		}
-		for (i = 0; i < cstate->num_partitions; i++)
+		for (i = 0; i < ptr->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions[i];
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
 		/* Release the standalone partition tuple descriptor */
-		ExecDropSingleTupleTableSlot(cstate->partition_tuple_slot);
+		ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
 
 	/* Close any trigger target relations */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d545af2..d10f525 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -42,22 +42,9 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo* objects with one entry for
- *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
- * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
- *		to manipulate any given leaf partition's rowtype after that partition
- *		is chosen by tuple-routing.
- * 'num_parted' receives the number of partitioned tables in the partition
- *		tree (= the number of entries in the 'pd' output array)
- * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *
+ * 'partition_tuple_routing' encapsulates all the partition related information
+ *		required to do tuple routing.
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
@@ -67,29 +54,30 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions)
+							   PartitionTupleRouting **partition_tuple_routing)
 {
 	TupleDesc	tupDesc = RelationGetDescr(rel);
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
 	ResultRelInfo *leaf_part_rri;
+	PartitionTupleRouting *ptr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
 	 */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
-	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+	ptr = *partition_tuple_routing =
+		(PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	ptr->partition_dispatch_info =
+		RelationGetPartitionDispatchInfo(rel, &ptr->num_dispatch, &leaf_parts);
+	ptr->num_partitions = list_length(leaf_parts);
+	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	ptr->partition_tupconv_maps =
+		(TupleConversionMap **) palloc0(ptr->num_partitions *
+										sizeof(TupleConversionMap *));
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -97,9 +85,9 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 * (such as ModifyTableState) and released when the node finishes
 	 * processing.
 	 */
-	*partition_tuple_slot = MakeTupleTableSlot();
+	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
+	leaf_part_rri = (ResultRelInfo *) palloc0(ptr->num_partitions *
 											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
@@ -119,7 +107,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->partition_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
 		InitResultRelInfo(leaf_part_rri,
@@ -149,7 +137,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri++;
 		i++;
 	}
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index afb83ed..d5f2cfb 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -279,32 +279,33 @@ ExecInsert(ModifyTableState *mtstate,
 	resultRelInfo = estate->es_result_relation_info;
 
 	/* Determine the partition to heap_insert the tuple into */
-	if (mtstate->mt_partition_dispatch_info)
+	if (mtstate->mt_partition_tuple_routing)
 	{
 		int			leaf_part_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * ptr->partitions[] and ptr->partition_tupconv_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
-											mtstate->mt_partition_dispatch_info,
+											ptr->partition_dispatch_info,
 											slot,
 											estate);
 		Assert(leaf_part_index >= 0 &&
-			   leaf_part_index < mtstate->mt_num_partitions);
+			   leaf_part_index < ptr->num_partitions);
 
 		/*
 		 * Save the old ResultRelInfo and switch to the one corresponding to
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
+		resultRelInfo = ptr->partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -352,7 +353,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
+		map = ptr->partition_tupconv_maps[leaf_part_index];
 		if (map)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -364,7 +365,7 @@ ExecInsert(ModifyTableState *mtstate,
 			 * on, until we're finished dealing with the partition. Use the
 			 * dedicated slot for that.
 			 */
-			slot = mtstate->mt_partition_tuple_slot;
+			slot = ptr->partition_tuple_slot;
 			Assert(slot != NULL);
 			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -1500,9 +1501,10 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		mtstate->mt_oc_transition_capture != NULL)
 	{
 		int			numResultRelInfos;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
+		numResultRelInfos = (ptr != NULL ?
+							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
 		/*
@@ -1515,13 +1517,13 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
 		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (ptr != NULL)
 		{
 			/*
 			 * For tuple routing among partitions, we need TupleDescs based on
 			 * the partition routing table.
 			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+			ResultRelInfo **resultRelInfos = ptr->partitions;
 
 			for (i = 0; i < numResultRelInfos; ++i)
 			{
@@ -1833,6 +1835,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	PartitionTupleRouting *ptr = NULL;
+	int			num_partitions = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1946,28 +1950,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (operation == CMD_INSERT &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
-
 		ExecSetupPartitionTupleRouting(mtstate,
 									   rel,
 									   node->nominalRelation,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-		mtstate->mt_num_dispatch = num_parted;
-		mtstate->mt_partitions = partitions;
-		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
-		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+									   &mtstate->mt_partition_tuple_routing);
+
+		ptr = mtstate->mt_partition_tuple_routing;
+		num_partitions = ptr->num_partitions;
 	}
 
 	/*
@@ -2010,7 +2000,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
 	 * cases are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
+	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
 		List	   *wcoList;
 		PlanState  *plan;
@@ -2027,14 +2017,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			   mtstate->mt_nplans == 1);
 		wcoList = linitial(node->withCheckOptionLists);
 		plan = mtstate->mt_plans[0];
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2103,12 +2093,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * are handled above.
 		 */
 		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2377,29 +2367,33 @@ ExecEndModifyTable(ModifyTableState *node)
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
-	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
+	 * Remember ptr->partition_dispatch_info[0] corresponds to the root
 	 * partitioned table, which we must not try to close, because it is the
 	 * main target table of the query that will be closed by ExecEndPlan().
 	 * Also, tupslot is NULL for the root partitioned table.
 	 */
-	for (i = 1; i < node->mt_num_dispatch; i++)
+	if (node->mt_partition_tuple_routing)
 	{
-		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+		PartitionTupleRouting *ptr = node->mt_partition_tuple_routing;
 
-		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
-	}
-	for (i = 0; i < node->mt_num_partitions; i++)
-	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+		for (i = 1; i < ptr->num_dispatch; i++)
+		{
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
+			heap_close(pd->reldesc, NoLock);
+			ExecDropSingleTupleTableSlot(pd->tupslot);
+		}
+		for (i = 0; i < ptr->num_partitions; i++)
+		{
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+			ExecCloseIndices(resultRelInfo);
+			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+		}
 
-	/* Release the standalone partition tuple descriptor, if any */
-	if (node->mt_partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
+		/* Release the standalone partition tuple descriptor, if any */
+		if (ptr->partition_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 86a199d..364d89f 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -49,15 +49,45 @@ typedef struct PartitionDispatchData
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to execute
+ * tuple-routing between partitions.
+ *
+ * partition_dispatch_info		Array of PartitionDispatch objects with one
+ *								entry for every partitioned table in the
+ *								partition tree.
+ * num_dispatch					number of partitioned tables in the partition
+ *								tree (= length of partition_dispatch_info[])
+ * partitions					Array of ResultRelInfo* objects with one entry
+ *								for every leaf partition in the partition tree.
+ * num_partitions				Number of leaf partitions in the partition tree
+ *								(= 'partitions' array length)
+ * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the root table's
+ *								rowtype to a leaf partition's rowtype after
+ *								tuple routing is done)
+ * partition_tuple_slot			TupleTableSlot to be used to manipulate any
+ *								given leaf partition's rowtype after that
+ *								partition is chosen for insertion by
+ *								tuple-routing.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	TupleConversionMap **partition_tupconv_maps;
+	TupleTableSlot *partition_tuple_slot;
+} PartitionTupleRouting;
+
 extern void ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions);
+							   PartitionTupleRouting **partition_tuple_routing);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1a35c5c..613872a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -977,15 +977,8 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
-	struct PartitionDispatchData **mt_partition_dispatch_info;
+	struct PartitionTupleRouting *mt_partition_tuple_routing;
 	/* Tuple-routing support info */
-	int			mt_num_dispatch;	/* Number of entries in the above array */
-	int			mt_num_partitions;	/* Number of members in the following
-									 * arrays */
-	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
-	TupleTableSlot *mt_partition_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;

update-partition-key_v29.patchapplication/octet-stream; name=update-partition-key_v29.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a4..9d21f9a 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..3c665f0 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..aaffc4d 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by <command>INSERT</command> into the
+    new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index ef156e4..9f98597 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1445,7 +1445,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1458,8 +1459,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1468,14 +1469,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
@@ -2598,6 +2599,69 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 }
 
 /*
+ * Checks if any of the 'attnums' is a partition key attribute for rel
+ *
+ * Sets *used_in_expr if any of the 'attnums' is found to be referenced in some
+ * partition key expression.  It's possible for a column to be both used
+ * directly and as part of an expression; if that happens, *used_in_expr may
+ * end up as either true or false.  That's OK for current uses of this
+ * function, because *used_in_expr is only used to tailor the error message
+ * text.
+ */
+bool
+has_partition_attrs(Relation rel, Bitmapset *attnums, bool *used_in_expr)
+{
+	PartitionKey key;
+	int			partnatts;
+	List	   *partexprs;
+	ListCell   *partexprs_item;
+	int			i;
+
+	if (attnums == NULL || rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		return false;
+
+	key = RelationGetPartitionKey(rel);
+	partnatts = get_partition_natts(key);
+	partexprs = get_partition_exprs(key);
+
+	partexprs_item = list_head(partexprs);
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+		{
+			if (bms_is_member(partattno - FirstLowInvalidHeapAttributeNumber,
+							  attnums))
+			{
+				if (used_in_expr)
+					*used_in_expr = false;
+				return true;
+			}
+		}
+		else
+		{
+			/* Arbitrary expression */
+			Node	   *expr = (Node *) lfirst(partexprs_item);
+			Bitmapset  *expr_attrs = NULL;
+
+			/* Find all attributes referenced */
+			pull_varattnos(expr, 1, &expr_attrs);
+			partexprs_item = lnext(partexprs_item);
+
+			if (bms_overlap(attnums, expr_attrs))
+			{
+				if (used_in_expr)
+					*used_in_expr = true;
+				return true;
+			}
+		}
+	}
+
+	return false;
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f1149ed..bb91651 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2673,7 +2673,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = ptr->partition_tupconv_maps[leaf_part_index];
+			map = ptr->parentchild_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2736,7 +2736,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d979ce2..64c2185 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -468,7 +468,6 @@ static void RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid,
 								Oid oldRelOid, void *arg);
 static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
 								 Oid oldrelid, void *arg);
-static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
 static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
 static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 					  List **partexprs, Oid *partopclass, Oid *partcollation, char strategy);
@@ -6492,68 +6491,6 @@ ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
 }
 
 /*
- * Checks if attnum is a partition attribute for rel
- *
- * Sets *used_in_expr if attnum is found to be referenced in some partition
- * key expression.  It's possible for a column to be both used directly and
- * as part of an expression; if that happens, *used_in_expr may end up as
- * either true or false.  That's OK for current uses of this function, because
- * *used_in_expr is only used to tailor the error message text.
- */
-static bool
-is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr)
-{
-	PartitionKey key;
-	int			partnatts;
-	List	   *partexprs;
-	ListCell   *partexprs_item;
-	int			i;
-
-	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		return false;
-
-	key = RelationGetPartitionKey(rel);
-	partnatts = get_partition_natts(key);
-	partexprs = get_partition_exprs(key);
-
-	partexprs_item = list_head(partexprs);
-	for (i = 0; i < partnatts; i++)
-	{
-		AttrNumber	partattno = get_partition_col_attnum(key, i);
-
-		if (partattno != 0)
-		{
-			if (attnum == partattno)
-			{
-				if (used_in_expr)
-					*used_in_expr = false;
-				return true;
-			}
-		}
-		else
-		{
-			/* Arbitrary expression */
-			Node	   *expr = (Node *) lfirst(partexprs_item);
-			Bitmapset  *expr_attrs = NULL;
-
-			/* Find all attributes referenced */
-			pull_varattnos(expr, 1, &expr_attrs);
-			partexprs_item = lnext(partexprs_item);
-
-			if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
-							  expr_attrs))
-			{
-				if (used_in_expr)
-					*used_in_expr = true;
-				return true;
-			}
-		}
-	}
-
-	return false;
-}
-
-/*
  * Return value is the address of the dropped column.
  */
 static ObjectAddress
@@ -6613,7 +6550,9 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 						colName)));
 
 	/* Don't drop columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
@@ -8837,7 +8776,9 @@ ATPrepAlterColumnType(List **wqueue,
 						colName)));
 
 	/* Don't alter columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..73ec872 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for row being inserted,
+		 * whereas newtup is NULL when the event is for row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,17 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so either can be NULL, not both.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dbaa47f..5ec92d5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d10f525..418d3ff 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -41,6 +41,13 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * ExecSetupPartitionTupleRouting - set up information needed during
  * tuple routing for partitioned tables
  *
+ * 'update_rri' contains the UPDATE per-subplan result rels. For the output
+ *		param 'partitions', we don't allocate new ResultRelInfo objects for
+ *		leaf partitions for which they are already available in 'update_rri'.
+ *
+ * 'num_update_rri' is the number of elements in 'update_rri' array or zero for
+ *      INSERT.
+ *
  * Output arguments:
  *
  * 'partition_tuple_routing' encapsulates all the partition related information
@@ -60,9 +67,23 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL,
+				  *update_rri = NULL;
+	int			num_update_rri = 0,
+				update_rri_index = 0;
+	bool		is_update = false;
 	PartitionTupleRouting *ptr;
 
+	/* Initialization specific to update */
+	if (mtstate && mtstate->operation == CMD_UPDATE)
+	{
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+		is_update = true;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+	}
+
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
@@ -75,10 +96,48 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	ptr->num_partitions = list_length(leaf_parts);
 	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	ptr->partition_tupconv_maps =
+	ptr->parentchild_tupconv_maps =
 		(TupleConversionMap **) palloc0(ptr->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	if (is_update)
+	{
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		ptr->subplan_partition_offsets = palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		ptr->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(ptr->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -87,37 +146,80 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 */
 	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(ptr->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				ptr->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * *partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		ptr->partition_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->parentchild_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -137,9 +239,15 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		ptr->partitions[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
 }
 
 /*
@@ -170,8 +278,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index bd786a1..995c54c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d5f2cfb..246d759 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +251,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +308,9 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,14 +328,13 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * ptr->partitions[] and ptr->partition_tupconv_maps[] that will get us
-		 * the ResultRelInfo and TupleConversionMap for the partition,
+		 * ptr->partitions[] and ptr->parentchild_tupconv_maps[] that will get
+		 * us the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
@@ -332,8 +376,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -346,30 +392,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = ptr->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = ptr->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  ptr->parentchild_tupconv_maps[leaf_part_index],
+										  tuple,
+										  ptr->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -450,6 +487,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -467,14 +505,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -487,7 +532,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -623,9 +668,32 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tables, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +747,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tuple_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +756,12 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
+
+	if (tuple_deleted)
+		*tuple_deleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +926,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (tuple_deleted)
+		*tuple_deleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1051,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1019,6 +1123,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1034,22 +1139,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case, we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (ptr == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip INSERT as
+			 * well, otherwise, there will be effectively one new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * UPDATEs set the transition capture map only when a new subplan
+			 * is chosen.  But for INSERTs, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INESRT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  ptr->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1702,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1507,55 +1731,142 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(ptr != NULL));
+
+		/*
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
+		 */
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
+
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes :
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ * result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need
+ * to convert the tuple from subplan result rel to target table descriptor,
+ * and for INSERTs, we need to convert the inserted tuple from leaf partition
+ * to the target table descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
+
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-		/* Choose the right set of partitions */
-		if (ptr != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = ptr->partitions;
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based on the
+		 * partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		Assert(mtstate->mt_partition_tuple_routing != NULL);
+		resultRelInfos = mtstate->mt_partition_tuple_routing->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+
+		Assert(ptr && ptr->subplan_partition_offsets != NULL);
+		leaf_index = ptr->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < ptr->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1973,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2096,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1832,9 +2142,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 	PartitionTupleRouting *ptr = NULL;
 	int			num_partitions = 0;
 
@@ -1909,6 +2222,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1946,9 +2269,19 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		ExecSetupPartitionTupleRouting(mtstate,
 									   rel,
@@ -1958,6 +2291,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		ptr = mtstate->mt_partition_tuple_routing;
 		num_partitions = ptr->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1968,6 +2308,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1997,26 +2349,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2025,17 +2380,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2052,7 +2416,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2089,22 +2453,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2349,6 +2726,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2386,11 +2764,24 @@ ExecEndModifyTable(ModifyTableState *node)
 		for (i = 0; i < ptr->num_partitions; i++)
 		{
 			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If this result rel is one of the subplan result rels, let
+			 * ExecEndPlan() close it. For INSERTs, this does not apply because
+			 * leaf partition result rels are always newly allocated.
+			 */
+			if (operation == CMD_UPDATE &&
+				resultRelInfo >= node->resultRelInfo &&
+				resultRelInfo < node->resultRelInfo + node->mt_nplans)
+				continue;
+
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
-		/* Release the standalone partition tuple descriptor, if any */
+		/* Release the standalone partition tuple descriptors, if any */
+		if (ptr->root_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->root_tuple_slot);
 		if (ptr->partition_tuple_slot)
 			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index b1515dd..988ea00 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2262,6 +2263,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(is_partition_key_update);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 2e869a9..b4b7639 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(is_partition_key_update);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b59a521..78a367d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2104,6 +2105,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2526,6 +2528,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(is_partition_key_update);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0d17ae8..e2c27e0 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0e8463e..be0d162 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,7 +1364,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1403,7 +1403,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f6c83d0..38c429d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -279,6 +279,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2373,6 +2374,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6432,6 +6434,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6458,6 +6461,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e8bc15c..df3b599 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6155,17 +6159,22 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets is_partition_key_update
+ *		to true if any of the root rte's updated columns is a partition key.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (is_partition_key_update)
+		*is_partition_key_update = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6173,6 +6182,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (is_partition_key_update)
+				*is_partition_key_update = pc->is_partition_key_update;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index a24e8ac..c6e1b9e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1467,16 +1468,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		is_partition_key_update = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &is_partition_key_update);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1493,6 +1497,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->is_partition_key_update = is_partition_key_update;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1569,7 +1574,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1584,6 +1590,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*is_partition_key_update)
+		*is_partition_key_update =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1623,7 +1640,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   is_partition_key_update);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 54126fb..ea207fe 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3265,6 +3265,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the named relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3278,6 +3280,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3345,6 +3348,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2983cfa..ff49ecc 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -54,12 +54,16 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
-
+extern void pull_child_partition_columns(Relation rel,
+							 Relation parent,
+							 Bitmapset **partcols);
+extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
+							bool *used_in_expr);
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);
 extern void update_default_partition_oid(Oid parentId, Oid defaultPartId);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 364d89f..4d00a3e 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -62,11 +62,14 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								for every leaf partition in the partition tree.
  * num_partitions				Number of leaf partitions in the partition tree
  *								(= 'partitions' array length)
- * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ * parentchild_tupconv_maps		Array of TupleConversionMap objects with one
  *								entry for every leaf partition (required to
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -79,8 +82,10 @@ typedef struct PartitionTupleRouting
 	int			num_dispatch;
 	ResultRelInfo **partitions;
 	int			num_partitions;
-	TupleConversionMap **partition_tupconv_maps;
+	TupleConversionMap **parentchild_tupconv_maps;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
 extern void ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b5578f5..5a385e2 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 613872a..e910567 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -983,8 +983,10 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
+	/* Stores position of update result rels in leaf partitions */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 02fb366..6fc368a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 1108b6a..197e523 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1673,6 +1673,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2123,6 +2124,9 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		is_partition_key_update;	/* is the partition key of any of
+											 * the partitioned tables
+											 * updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 99f65b4..9b739ec 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -241,6 +241,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2801bfd..9f0533c 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..0dfd3a6 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,441 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +640,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +703,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +829,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..53c6441 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,311 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+
+:init_range_parted;
+set session authorization regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +420,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +449,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +548,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#210

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 8 years ago

In reply to: Amit Khandekar (#209)

Re: [HACKERS] UPDATE of partition key

Thanks for the updated patches, Amit.

Some review comments.

Forgot to remove the description of update_rri and num_update_rri in the
header comment of ExecSetupPartitionTupleRouting().

-
+extern void pull_child_partition_columns(Relation rel,
+                             Relation parent,
+                             Bitmapset **partcols);

It seems you forgot to remove this declaration in partition.h, because I
don't find it defined or used anywhere.

I think some of the changes that are currently part of the main patch are
better taken out into their own patches, because having those diffs appear
in the main patch is kind of distracting. Just like you now have a patch
that introduces a PartitionTupleRouting structure. I know that leads to
too many patches, but it helps to easily tell less substantial changes
from the substantial ones.

1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps.

2. Patch that introduces has_partition_attrs() in place of
is_partition_attr()

3. Patch to change the names of map_partition_varattnos() arguments

4. Patch that does the refactoring involving ExecConstrains(),
ExecPartitionCheck(), and the introduction of
ExecPartitionCheckEmitError()

Regarding ExecSetupChildParentMap(), it seems to me that it could simply
be declared as

static void ExecSetupChildParentMap(ModifyTableState *mtstate);

Looking at the places from where it's called, it seems that you're just
extracting information from mtstate and passing the same for the rest of
its arguments.

mt_is_tupconv_perpart seems like it's unnecessary. Its function could be
fulfilled by inspecting the state of some other fields of
ModifyTableState. For example, in the case of an update (operation ==
CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always
assume that mt_childparent_tupconv_maps has entries for all partitions.
If it's NULL, then there would be only entries for partitions that have
sub-plans.

tupconv_map_for_subplan() looks like it could be done as a macro.

Thanks,
Amit

#211

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#209)

Re: [HACKERS] UPDATE of partition key

On Wed, Dec 13, 2017 at 5:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Amit Langote informed me off-list, - along with suggestions for
changes - that my patch needs a rebase. Attached is the rebased
version. I have also bumped the patch version number (now v29),
because this as additional changes, again, suggested by Amit L :
Because ExecSetupPartitionTupleRouting() has mtstate parameter now,
no need to pass update_rri and num_update_rri, since they can be
retrieved from mtstate.

Also, the preparatory patch is also rebased.

Reviewing the preparatory patch:

+ PartitionTupleRouting *partition_tuple_routing;
+ /* Tuple-routing support info */

Something's wrong with the formatting here.

-    PartitionDispatch **pd,
-    ResultRelInfo ***partitions,
-    TupleConversionMap ***tup_conv_maps,
-    TupleTableSlot **partition_tuple_slot,
-    int *num_parted, int *num_partitions)
+    PartitionTupleRouting **partition_tuple_routing)

Since we're consolidating all of ExecSetupPartitionTupleRouting's
output parameters into a single structure, I think it might make more
sense to have it just return that value. I think it's only done with
output parameter today because there are so many different things
being produced, and we can't return them all.

+ PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;

This is just nitpicking, but I don't find "ptr" to be the greatest
variable name; it looks too much like "pointer". Maybe we could use
"routing" or "proute" or something.

It seems to me that we could improve things here by adding a function
ExecCleanupTupleRouting(PartitionTupleRouting *) which would do the
various heap_close(), ExecDropSingleTupleTableSlot(), and
ExecCloseIndices() operations which are currently performed in
CopyFrom() and, by separate code, in ExecEndModifyTable().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#212

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Robert Haas (#211)

Re: [HACKERS] UPDATE of partition key

On Fri, Dec 15, 2017 at 7:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Reviewing the preparatory patch:

I started another review pass over the main patch, so here are some
comments about that. This is unfortunately not a complete review,
however.

- map = ptr->partition_tupconv_maps[leaf_part_index];
+ map = ptr->parentchild_tupconv_maps[leaf_part_index];

I don't think there's any reason to rename this. In previous patch
versions, you had multiple arrays of tuple conversion maps in this
structure, but the refactoring eliminated that.

Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
-> mt_childparent_tupconv_maps. That seems like it could similarly be
left alone.

+ /*
+ * If transition tables are the only reason we're here, return. As
+ * mentioned above, we can also be here during update tuple routing in
+ * presence of transition tables, in which case this function is called
+ * separately for oldtup and newtup, so either can be NULL, not both.
+ */
  if (trigdesc == NULL ||
  (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
  (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
- (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+ (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

I guess this is correct, but it seems awfully fragile. Can't we have
some more explicit signaling about whether we're only here for
transition tables, rather than deducing it based on exactly one of
oldtup and newtup being NULL?

+ /* Initialization specific to update */
+ if (mtstate && mtstate->operation == CMD_UPDATE)
+ {
+ ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+ is_update = true;
+ update_rri = mtstate->resultRelInfo;
+ num_update_rri = list_length(node->plans);
+ }

I guess I don't see why we need a separate "if" block for this.
Neither is_update nor update_rri nor num_update_rri are used until we
get to the block that begins with "if (is_update)". Why not just
change that block to test if (mtstate && mtstate->operation ==
CMD_UPDATE)" and put the rest of these initializations inside that
block?

+ int num_update_rri = 0,
+ update_rri_index = 0;
...
+ update_rri_index = 0;

It's already 0.

+ leaf_part_rri = &update_rri[update_rri_index];
...
+ leaf_part_rri = leaf_part_arr + i;

These are doing the same kind of thing, but using different styles. I
prefer the former style, so I'd change the second one to
&leaf_part_arr[i]. Alternatively, you could change the first one to
update_rri + update_rri_indx. But it's strange to see the same
variable initialized in two different ways just a few lines apart.

+ if (!partrel)
+ {
+ /*
+ * We locked all the partitions above including the leaf
+ * partitions. Note that each of the newly opened relations in
+ * *partitions are eventually closed by the caller.
+ */
+ partrel = heap_open(leaf_oid, NoLock);
+ InitResultRelInfo(leaf_part_rri,
+   partrel,
+   resultRTindex,
+   rel,
+   estate->es_instrument);
+ }

Hmm, isn't there a problem here? Before, we opened all the relations
here and the caller closed them all. But now, we're only opening some
of them. If the caller closes them all, then they will be closing
some that we opened and some that we didn't. That seems quite bad,
because the reference counts that are incremented and decremented by
opening and closing should all end up at 0. Maybe I'm confused
because it seems like this would break in any scenario where even 1
relation was already opened and surely you must have tested that
case... but if there's some reason this works, I don't know what it
is, and the comment doesn't tell me.

+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+   TupleConversionMap *map,
+   HeapTuple tuple,
+   TupleTableSlot *new_slot,
+   TupleTableSlot **p_my_slot)

This function doesn't use the mtstate argument at all.

+ * (Similarly we need to add the deleted row in OLD TABLE). We need to do

The period should be before, not after, the closing parenthesis.

+ * Now that we have already captured NEW TABLE row, any AR INSERT
+ * trigger should not again capture it below. Arrange for the same.

A more American style would be something like "We've already captured
the NEW TABLE row, so make sure any AR INSERT trigger fired below
doesn't capture it again." (Similarly for the other case.)

+ /* The delete has actually happened, so inform that to the caller */
+ if (tuple_deleted)
+ *tuple_deleted = true;

In the US, we inform the caller, not inform that to the caller. In
other words, here the direct object of "inform" is the person or thing
getting the information (in this case, "the caller"), not the
information being conveyed (in this case, "that"). I realize your
usage is probably typical for your country...

+ Assert(mtstate->mt_is_tupconv_perpart == true);

We usually just Assert(thing_that_should_be_true), not
Assert(thing_that_should_be_true == true).

+ * In case this is part of update tuple routing, put this row into the
+ * transition OLD TABLE if we are capturing transition tables. We need to
+ * do this separately for DELETE and INSERT because they happen on
+ * different tables.

Maybe "...OLD table, but only if we are..."

Should it be capturing transition tables or capturing transition
tuples? I'm not sure.

+ * partition, in which case, we should check the RLS CHECK policy just

In the US, the second comma in this sentence is incorrect and should be removed.

+ * When an UPDATE is run with a leaf partition, we would not have
+ * partition tuple routing setup. In that case, fail with

run with -> run on
would not -> will not
setup -> set up

+ * deleted by another transaction), then we should skip INSERT as
+ * well, otherwise, there will be effectively one new row inserted.

skip INSERT -> skip the insert
well, otherwise -> well; otherwise

I would also change "there will be effectively one new row inserted"
to "an UPDATE could cause an increase in the total number of rows
across all partitions, which is clearly wrong".

+ /*
+ * UPDATEs set the transition capture map only when a new subplan
+ * is chosen.  But for INSERTs, it is set for each row. So after
+ * INSERT, we need to revert back to the map created for UPDATE;
+ * otherwise the next UPDATE will incorrectly use the one created
+ * for INESRT.  So first save the one created for UPDATE.
+ */
+ if (mtstate->mt_transition_capture)
+ saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

UPDATEs -> Updates
INESRT -> INSERT

I wonder if there is some more elegant way to handle this problem.
Basically, the issue is that ExecInsert() is stomping on
mtstate->mt_transition_capture, and your solution is to save and
restore the value you want to have there. But maybe we could instead
find a way to get ExecInsert() not to stomp on that state in the first
place. It seems like the ON CONFLICT stuff handled that by adding a
second TransitionCaptureState pointer to ModifyTable, thus
mt_transition_capture and mt_oc_transition_capture. By that
precedent, we could add mt_utr_transition_capture or similar, and
maybe that's the way to go. It seems a bit unsatisfying, but so does
what you have now.

+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need

This isn't worded well. A transition table is never a partition;
transition tables and partitions are two different kinds of things.

+ * If per-leaf map is required and the map is already created, that map
+ * has to be per-leaf. If that map is per-subplan, we won't be able to
+ * access the maps leaf-partition-wise. But if the map is per-leaf, we
+ * will be able to access the maps subplan-wise using the
+ * subplan_partition_offsets map using function
+ * tupconv_map_for_subplan().  So if the callers might need to access
+ * the map both leaf-partition-wise and subplan-wise, they should make
+ * sure that the first time this function is called, it should be
+ * called with perleaf=true so that the map created is per-leaf, not
+ * per-subplan.

This sounds complicated and fragile. It ends up meaning that
mt_childparent_tupconv_maps is sometimes indexed by subplan number and
sometimes by partition leaf index, which is extremely confusing and
likely to lead to coding errors, either in this patch or in future
ones. Would it be reasonable to just always do this by partition leaf
index, even if we don't strictly need that set of mappings?

That's all I've got for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#213

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Langote (#210)

5 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 14 December 2017 at 08:11, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Forgot to remove the description of update_rri and num_update_rri in the
header comment of ExecSetupPartitionTupleRouting().
-
+extern void pull_child_partition_columns(Relation rel,
+                             Relation parent,
+                             Bitmapset **partcols);
It seems you forgot to remove this declaration in partition.h, because I
don't find it defined or used anywhere.

Done both of the above. Attached v30 patch has the above changes.

I think some of the changes that are currently part of the main patch are
better taken out into their own patches, because having those diffs appear
in the main patch is kind of distracting. Just like you now have a patch
that introduces a PartitionTupleRouting structure. I know that leads to
too many patches, but it helps to easily tell less substantial changes
from the substantial ones.

Done. Created patches as shown below :

1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps.

As per Robert's suggestion, reverted back the renaming of this field.

2. Patch that introduces has_partition_attrs() in place of
is_partition_attr()

0002-Changed-is_partition_attr-to-has_partition_attrs.patch

3. Patch to change the names of map_partition_varattnos() arguments

0003-Renaming-parameters-of-map_partition_var_attnos.patch

4. Patch that does the refactoring involving ExecConstrains(),
ExecPartitionCheck(), and the introduction of
ExecPartitionCheckEmitError()

0004-Refactor-CheckConstraint-related-code.patch

The preparatory patches are to be applied in order of the patch
numbers, followed by the main patch update-partition-key_v30.patch

Regarding ExecSetupChildParentMap(), it seems to me that it could simply
be declared as

static void ExecSetupChildParentMap(ModifyTableState *mtstate);

Looking at the places from where it's called, it seems that you're just
extracting information from mtstate and passing the same for the rest of
its arguments.

Agreed. But the last parameter per_leaf might be necessary. I will
defer this until I address Robert's concern about the complexity of
the related code.

mt_is_tupconv_perpart seems like it's unnecessary. Its function could be
fulfilled by inspecting the state of some other fields of
ModifyTableState. For example, in the case of an update (operation ==
CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always
assume that mt_childparent_tupconv_maps has entries for all partitions.
If it's NULL, then there would be only entries for partitions that have
sub-plans.

I think we better have this field separately for code-clarity, and to
avoid repeated execution of multiple conditions, and in order to have
some signficant Asserts() that use this field.

tupconv_map_for_subplan() looks like it could be done as a macro.

Or may be inline function. I will again defer this for similar reason
as the above deferred item about ExecSetupChildParentMap parameters.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v30.patchapplication/octet-stream; name=update-partition-key_v30.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a4..9d21f9a 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..3c665f0 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..aaffc4d 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by <command>INSERT</command> into the
+    new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..73ec872 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for row being inserted,
+		 * whereas newtup is NULL when the event is for row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,17 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so either can be NULL, not both.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 58ec51c..20c7f34 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -60,9 +60,23 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL,
+				  *update_rri = NULL;
+	int			num_update_rri = 0,
+				update_rri_index = 0;
+	bool		is_update = false;
 	PartitionTupleRouting *ptr;
 
+	/* Initialization specific to update */
+	if (mtstate && mtstate->operation == CMD_UPDATE)
+	{
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+		is_update = true;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+	}
+
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
@@ -79,6 +93,44 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		(TupleConversionMap **) palloc0(ptr->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	if (is_update)
+	{
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		ptr->subplan_partition_offsets = palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		ptr->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(ptr->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -87,20 +139,67 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 */
 	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(ptr->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				ptr->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in ptr->partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * ptr->partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -110,14 +209,10 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		ptr->partition_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -137,9 +232,15 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		ptr->partitions[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
 }
 
 /*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index deb0810..713a362 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +251,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +308,9 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,7 +328,6 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -332,8 +376,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -346,30 +392,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = ptr->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = ptr->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  ptr->partition_tupconv_maps[leaf_part_index],
+										  tuple,
+										  ptr->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -450,6 +487,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -467,14 +505,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -623,9 +668,32 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tables, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +747,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tuple_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +756,12 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
+
+	if (tuple_deleted)
+		*tuple_deleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +926,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (tuple_deleted)
+		*tuple_deleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1051,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1019,6 +1123,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1034,22 +1139,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case, we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (ptr == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip INSERT as
+			 * well, otherwise, there will be effectively one new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * UPDATEs set the transition capture map only when a new subplan
+			 * is chosen.  But for INSERTs, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INESRT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  ptr->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate, true);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1702,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1507,55 +1731,142 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(ptr != NULL));
+
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (ptr != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = ptr->partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes :
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ * result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need
+ * to convert the tuple from subplan result rel to target table descriptor,
+ * and for INSERTs, we need to convert the inserted tuple from leaf partition
+ * to the target table descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
+		/*
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
+		 */
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based on the
+		 * partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		Assert(mtstate->mt_partition_tuple_routing != NULL);
+		resultRelInfos = mtstate->mt_partition_tuple_routing->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
+
+		Assert(ptr && ptr->subplan_partition_offsets != NULL);
+		leaf_index = ptr->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < ptr->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1973,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2096,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1832,9 +2142,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 	PartitionTupleRouting *ptr = NULL;
 	int			num_partitions = 0;
 
@@ -1909,6 +2222,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1946,9 +2269,19 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		ExecSetupPartitionTupleRouting(mtstate,
 									   rel,
@@ -1958,6 +2291,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		ptr = mtstate->mt_partition_tuple_routing;
 		num_partitions = ptr->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1968,6 +2308,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1997,26 +2349,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2025,17 +2380,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2052,7 +2416,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2089,22 +2453,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2349,6 +2726,7 @@ void
 ExecEndModifyTable(ModifyTableState *node)
 {
 	int			i;
+	CmdType		operation = node->operation;
 
 	/*
 	 * Allow any FDWs to shut down
@@ -2386,11 +2764,24 @@ ExecEndModifyTable(ModifyTableState *node)
 		for (i = 0; i < ptr->num_partitions; i++)
 		{
 			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+
+			/*
+			 * If this result rel is one of the subplan result rels, let
+			 * ExecEndPlan() close it. For INSERTs, this does not apply because
+			 * leaf partition result rels are always newly allocated.
+			 */
+			if (operation == CMD_UPDATE &&
+				resultRelInfo >= node->resultRelInfo &&
+				resultRelInfo < node->resultRelInfo + node->mt_nplans)
+				continue;
+
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
-		/* Release the standalone partition tuple descriptor, if any */
+		/* Release the standalone partition tuple descriptors, if any */
+		if (ptr->root_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->root_tuple_slot);
 		if (ptr->partition_tuple_slot)
 			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index b1515dd..988ea00 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2262,6 +2263,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(is_partition_key_update);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 2e869a9..b4b7639 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(is_partition_key_update);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b59a521..78a367d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2104,6 +2105,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2526,6 +2528,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(is_partition_key_update);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 0d17ae8..e2c27e0 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0e8463e..be0d162 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,7 +1364,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1403,7 +1403,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f6c83d0..38c429d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -279,6 +279,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2373,6 +2374,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6432,6 +6434,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6458,6 +6461,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 382791f..8b37609 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6155,17 +6159,22 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets is_partition_key_update
+ *		to true if any of the root rte's updated columns is a partition key.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (is_partition_key_update)
+		*is_partition_key_update = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6173,6 +6182,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (is_partition_key_update)
+				*is_partition_key_update = pc->is_partition_key_update;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index a24e8ac..c6e1b9e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1467,16 +1468,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		is_partition_key_update = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &is_partition_key_update);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1493,6 +1497,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->is_partition_key_update = is_partition_key_update;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1569,7 +1574,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1584,6 +1590,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*is_partition_key_update)
+		*is_partition_key_update =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1623,7 +1640,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   is_partition_key_update);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 54126fb..ea207fe 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3265,6 +3265,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the named relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3278,6 +3280,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3345,6 +3348,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 364d89f..d9fe0e4 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -67,6 +67,9 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -80,7 +83,9 @@ typedef struct PartitionTupleRouting
 	ResultRelInfo **partitions;
 	int			num_partitions;
 	TupleConversionMap **partition_tupconv_maps;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
 extern void ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 613872a..6082f7b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -983,8 +983,9 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 02fb366..6fc368a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 1108b6a..197e523 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1673,6 +1673,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2123,6 +2124,9 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		is_partition_key_update;	/* is the partition key of any of
+											 * the partitioned tables
+											 * updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 99f65b4..9b739ec 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -241,6 +241,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2801bfd..9f0533c 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..0dfd3a6 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,441 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +640,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +703,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +829,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..53c6441 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,311 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+
+:init_range_parted;
+set session authorization regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +420,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +449,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +548,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

0001-Encapsulate-partition-related-info-in-a-structure.patchapplication/octet-stream; name=0001-Encapsulate-partition-related-info-in-a-structure.patchDownload

From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 22 Nov 2017 15:59:15 +0530
Subject: Encapsulate partition-related info in a structure.

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 254be28..f1149ed 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -166,12 +166,9 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;	/* Number of entries in the above array */
-	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo **partitions; /* Per partition result relation pointers */
-	TupleConversionMap **partition_tupconv_maps;
-	TupleTableSlot *partition_tuple_slot;
+	PartitionTupleRouting *partition_tuple_routing;
+	/* Tuple-routing support info */
+
 	TransitionCaptureState *transition_capture;
 	TupleConversionMap **transition_tupconv_maps;
 
@@ -2472,28 +2469,15 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
+		PartitionTupleRouting *ptr;
 
 		ExecSetupPartitionTupleRouting(NULL,
 									   cstate->rel,
 									   1,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		cstate->partition_dispatch_info = partition_dispatch_info;
-		cstate->num_dispatch = num_parted;
-		cstate->partitions = partitions;
-		cstate->num_partitions = num_partitions;
-		cstate->partition_tupconv_maps = partition_tupconv_maps;
-		cstate->partition_tuple_slot = partition_tuple_slot;
+									   &cstate->partition_tuple_routing);
+
+		ptr = cstate->partition_tuple_routing;
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2506,11 +2490,11 @@ CopyFrom(CopyState cstate)
 			int			i;
 
 			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * cstate->num_partitions);
-			for (i = 0; i < cstate->num_partitions; ++i)
+				palloc0(sizeof(TupleConversionMap *) * ptr->num_partitions);
+			for (i = 0; i < ptr->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(ptr->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2530,7 +2514,7 @@ CopyFrom(CopyState cstate)
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		cstate->partition_dispatch_info != NULL ||
+		cstate->partition_tuple_routing != NULL ||
 		cstate->volatile_defexprs)
 	{
 		useHeapMultiInsert = false;
@@ -2605,10 +2589,11 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
+		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
 			TupleConversionMap *map;
+			PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2619,11 +2604,11 @@ CopyFrom(CopyState cstate)
 			 * partition, respectively.
 			 */
 			leaf_part_index = ExecFindPartition(resultRelInfo,
-												cstate->partition_dispatch_info,
+												ptr->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < cstate->num_partitions);
+				   leaf_part_index < ptr->num_partitions);
 
 			/*
 			 * If this tuple is mapped to a partition that is not same as the
@@ -2641,7 +2626,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions[leaf_part_index];
+			resultRelInfo = ptr->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2688,7 +2673,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = cstate->partition_tupconv_maps[leaf_part_index];
+			map = ptr->partition_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2700,7 +2685,7 @@ CopyFrom(CopyState cstate)
 				 * point on.  Use a dedicated slot from this point on until
 				 * we're finished dealing with the partition.
 				 */
-				slot = cstate->partition_tuple_slot;
+				slot = ptr->partition_tuple_slot;
 				Assert(slot != NULL);
 				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -2852,8 +2837,9 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_dispatch_info)
+	if (cstate->partition_tuple_routing)
 	{
+		PartitionTupleRouting *ptr = cstate->partition_tuple_routing;
 		int			i;
 
 		/*
@@ -2862,23 +2848,23 @@ CopyFrom(CopyState cstate)
 		 * the main target table of COPY that will be closed eventually by
 		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
 		 */
-		for (i = 1; i < cstate->num_dispatch; i++)
+		for (i = 1; i < ptr->num_dispatch; i++)
 		{
-			PartitionDispatch pd = cstate->partition_dispatch_info[i];
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
 			heap_close(pd->reldesc, NoLock);
 			ExecDropSingleTupleTableSlot(pd->tupslot);
 		}
-		for (i = 0; i < cstate->num_partitions; i++)
+		for (i = 0; i < ptr->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions[i];
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
 		/* Release the standalone partition tuple descriptor */
-		ExecDropSingleTupleTableSlot(cstate->partition_tuple_slot);
+		ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
 	}
 
 	/* Close any trigger target relations */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d545af2..f8f52a1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -42,22 +42,9 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo* objects with one entry for
- *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
- * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
- *		to manipulate any given leaf partition's rowtype after that partition
- *		is chosen by tuple-routing.
- * 'num_parted' receives the number of partitioned tables in the partition
- *		tree (= the number of entries in the 'pd' output array)
- * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *
+ * 'partition_tuple_routing' encapsulates all the partition related information
+ *		required to do tuple routing.
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
@@ -67,29 +54,30 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions)
+							   PartitionTupleRouting **partition_tuple_routing)
 {
 	TupleDesc	tupDesc = RelationGetDescr(rel);
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
 	ResultRelInfo *leaf_part_rri;
+	PartitionTupleRouting *ptr;
 
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
 	 */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
-	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo **) palloc(*num_partitions *
+	ptr = *partition_tuple_routing =
+		(PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	ptr->partition_dispatch_info =
+		RelationGetPartitionDispatchInfo(rel, &ptr->num_dispatch, &leaf_parts);
+	ptr->num_partitions = list_length(leaf_parts);
+	ptr->partitions = (ResultRelInfo **) palloc(ptr->num_partitions *
 											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	ptr->partition_tupconv_maps =
+		(TupleConversionMap **) palloc0(ptr->num_partitions *
+										sizeof(TupleConversionMap *));
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -97,9 +85,9 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 * (such as ModifyTableState) and released when the node finishes
 	 * processing.
 	 */
-	*partition_tuple_slot = MakeTupleTableSlot();
+	ptr->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
+	leaf_part_rri = (ResultRelInfo *) palloc0(ptr->num_partitions *
 											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
@@ -109,7 +97,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 
 		/*
 		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
+		 * Note that each of the relations in ptr->partitions are eventually
 		 * closed by the caller.
 		 */
 		partrel = heap_open(lfirst_oid(cell), NoLock);
@@ -119,7 +107,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
+		ptr->partition_tupconv_maps[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
 													 gettext_noop("could not convert row type"));
 
 		InitResultRelInfo(leaf_part_rri,
@@ -149,7 +137,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		ptr->partitions[i] = leaf_part_rri++;
 		i++;
 	}
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index afb83ed..d5f2cfb 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -279,32 +279,33 @@ ExecInsert(ModifyTableState *mtstate,
 	resultRelInfo = estate->es_result_relation_info;
 
 	/* Determine the partition to heap_insert the tuple into */
-	if (mtstate->mt_partition_dispatch_info)
+	if (mtstate->mt_partition_tuple_routing)
 	{
 		int			leaf_part_index;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
+		 * ptr->partitions[] and ptr->partition_tupconv_maps[] that will get us
 		 * the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
-											mtstate->mt_partition_dispatch_info,
+											ptr->partition_dispatch_info,
 											slot,
 											estate);
 		Assert(leaf_part_index >= 0 &&
-			   leaf_part_index < mtstate->mt_num_partitions);
+			   leaf_part_index < ptr->num_partitions);
 
 		/*
 		 * Save the old ResultRelInfo and switch to the one corresponding to
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
+		resultRelInfo = ptr->partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -352,7 +353,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
+		map = ptr->partition_tupconv_maps[leaf_part_index];
 		if (map)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -364,7 +365,7 @@ ExecInsert(ModifyTableState *mtstate,
 			 * on, until we're finished dealing with the partition. Use the
 			 * dedicated slot for that.
 			 */
-			slot = mtstate->mt_partition_tuple_slot;
+			slot = ptr->partition_tuple_slot;
 			Assert(slot != NULL);
 			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -1500,9 +1501,10 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		mtstate->mt_oc_transition_capture != NULL)
 	{
 		int			numResultRelInfos;
+		PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
+		numResultRelInfos = (ptr != NULL ?
+							 ptr->num_partitions :
 							 mtstate->mt_nplans);
 
 		/*
@@ -1515,13 +1517,13 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
 		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (ptr != NULL)
 		{
 			/*
 			 * For tuple routing among partitions, we need TupleDescs based on
 			 * the partition routing table.
 			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+			ResultRelInfo **resultRelInfos = ptr->partitions;
 
 			for (i = 0; i < numResultRelInfos; ++i)
 			{
@@ -1833,6 +1835,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	PartitionTupleRouting *ptr = NULL;
+	int			num_partitions = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1946,28 +1950,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (operation == CMD_INSERT &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
-
 		ExecSetupPartitionTupleRouting(mtstate,
 									   rel,
 									   node->nominalRelation,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-		mtstate->mt_num_dispatch = num_parted;
-		mtstate->mt_partitions = partitions;
-		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
-		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+									   &mtstate->mt_partition_tuple_routing);
+
+		ptr = mtstate->mt_partition_tuple_routing;
+		num_partitions = ptr->num_partitions;
 	}
 
 	/*
@@ -2010,7 +2000,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
 	 * cases are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
+	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
 		List	   *wcoList;
 		PlanState  *plan;
@@ -2027,14 +2017,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			   mtstate->mt_nplans == 1);
 		wcoList = linitial(node->withCheckOptionLists);
 		plan = mtstate->mt_plans[0];
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2103,12 +2093,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * are handled above.
 		 */
 		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = ptr->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2377,29 +2367,33 @@ ExecEndModifyTable(ModifyTableState *node)
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
-	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
+	 * Remember ptr->partition_dispatch_info[0] corresponds to the root
 	 * partitioned table, which we must not try to close, because it is the
 	 * main target table of the query that will be closed by ExecEndPlan().
 	 * Also, tupslot is NULL for the root partitioned table.
 	 */
-	for (i = 1; i < node->mt_num_dispatch; i++)
+	if (node->mt_partition_tuple_routing)
 	{
-		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+		PartitionTupleRouting *ptr = node->mt_partition_tuple_routing;
 
-		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
-	}
-	for (i = 0; i < node->mt_num_partitions; i++)
-	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+		for (i = 1; i < ptr->num_dispatch; i++)
+		{
+			PartitionDispatch pd = ptr->partition_dispatch_info[i];
 
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
+			heap_close(pd->reldesc, NoLock);
+			ExecDropSingleTupleTableSlot(pd->tupslot);
+		}
+		for (i = 0; i < ptr->num_partitions; i++)
+		{
+			ResultRelInfo *resultRelInfo = ptr->partitions[i];
+			ExecCloseIndices(resultRelInfo);
+			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+		}
 
-	/* Release the standalone partition tuple descriptor, if any */
-	if (node->mt_partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
+		/* Release the standalone partition tuple descriptor, if any */
+		if (ptr->partition_tuple_slot)
+			ExecDropSingleTupleTableSlot(ptr->partition_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 86a199d..364d89f 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -49,15 +49,45 @@ typedef struct PartitionDispatchData
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to execute
+ * tuple-routing between partitions.
+ *
+ * partition_dispatch_info		Array of PartitionDispatch objects with one
+ *								entry for every partitioned table in the
+ *								partition tree.
+ * num_dispatch					number of partitioned tables in the partition
+ *								tree (= length of partition_dispatch_info[])
+ * partitions					Array of ResultRelInfo* objects with one entry
+ *								for every leaf partition in the partition tree.
+ * num_partitions				Number of leaf partitions in the partition tree
+ *								(= 'partitions' array length)
+ * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the root table's
+ *								rowtype to a leaf partition's rowtype after
+ *								tuple routing is done)
+ * partition_tuple_slot			TupleTableSlot to be used to manipulate any
+ *								given leaf partition's rowtype after that
+ *								partition is chosen for insertion by
+ *								tuple-routing.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	TupleConversionMap **partition_tupconv_maps;
+	TupleTableSlot *partition_tuple_slot;
+} PartitionTupleRouting;
+
 extern void ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions);
+							   PartitionTupleRouting **partition_tuple_routing);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1a35c5c..613872a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -977,15 +977,8 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
-	struct PartitionDispatchData **mt_partition_dispatch_info;
+	struct PartitionTupleRouting *mt_partition_tuple_routing;
 	/* Tuple-routing support info */
-	int			mt_num_dispatch;	/* Number of entries in the above array */
-	int			mt_num_partitions;	/* Number of members in the following
-									 * arrays */
-	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
-	TupleTableSlot *mt_partition_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;

0002-Changed-is_partition_attr-to-has_partition_attrs.patchapplication/octet-stream; name=0002-Changed-is_partition_attr-to-has_partition_attrs.patchDownload

From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 22 Nov 2017 15:59:15 +0530
Subject: Changed is_partition_attr() to has_partition_attrs()

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 5c4018e..8a3b0ed 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -2599,6 +2599,70 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 }
 
 /*
+ * Checks if any of the 'attnums' is a partition key attribute for rel
+ *
+ * Sets *used_in_expr if any of the 'attnums' is found to be referenced in some
+ * partition key expression.  It's possible for a column to be both used
+ * directly and as part of an expression; if that happens, *used_in_expr may
+ * end up as either true or false.  That's OK for current uses of this
+ * function, because *used_in_expr is only used to tailor the error message
+ * text.
+ */
+bool
+has_partition_attrs(Relation rel, Bitmapset *attnums,
+					bool *used_in_expr)
+{
+	PartitionKey key;
+	int			partnatts;
+	List	   *partexprs;
+	ListCell   *partexprs_item;
+	int			i;
+
+	if (attnums == NULL || rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		return false;
+
+	key = RelationGetPartitionKey(rel);
+	partnatts = get_partition_natts(key);
+	partexprs = get_partition_exprs(key);
+
+	partexprs_item = list_head(partexprs);
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+		{
+			if (bms_is_member(partattno - FirstLowInvalidHeapAttributeNumber,
+							  attnums))
+			{
+				if (used_in_expr)
+					*used_in_expr = false;
+				return true;
+			}
+		}
+		else
+		{
+			/* Arbitrary expression */
+			Node	   *expr = (Node *) lfirst(partexprs_item);
+			Bitmapset  *expr_attrs = NULL;
+
+			/* Find all attributes referenced */
+			pull_varattnos(expr, 1, &expr_attrs);
+			partexprs_item = lnext(partexprs_item);
+
+			if (bms_overlap(attnums, expr_attrs))
+			{
+				if (used_in_expr)
+					*used_in_expr = true;
+				return true;
+			}
+		}
+	}
+
+	return false;
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d979ce2..64c2185 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -468,7 +468,6 @@ static void RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid,
 								Oid oldRelOid, void *arg);
 static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
 								 Oid oldrelid, void *arg);
-static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
 static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
 static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 					  List **partexprs, Oid *partopclass, Oid *partcollation, char strategy);
@@ -6492,68 +6491,6 @@ ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
 }
 
 /*
- * Checks if attnum is a partition attribute for rel
- *
- * Sets *used_in_expr if attnum is found to be referenced in some partition
- * key expression.  It's possible for a column to be both used directly and
- * as part of an expression; if that happens, *used_in_expr may end up as
- * either true or false.  That's OK for current uses of this function, because
- * *used_in_expr is only used to tailor the error message text.
- */
-static bool
-is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr)
-{
-	PartitionKey key;
-	int			partnatts;
-	List	   *partexprs;
-	ListCell   *partexprs_item;
-	int			i;
-
-	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		return false;
-
-	key = RelationGetPartitionKey(rel);
-	partnatts = get_partition_natts(key);
-	partexprs = get_partition_exprs(key);
-
-	partexprs_item = list_head(partexprs);
-	for (i = 0; i < partnatts; i++)
-	{
-		AttrNumber	partattno = get_partition_col_attnum(key, i);
-
-		if (partattno != 0)
-		{
-			if (attnum == partattno)
-			{
-				if (used_in_expr)
-					*used_in_expr = false;
-				return true;
-			}
-		}
-		else
-		{
-			/* Arbitrary expression */
-			Node	   *expr = (Node *) lfirst(partexprs_item);
-			Bitmapset  *expr_attrs = NULL;
-
-			/* Find all attributes referenced */
-			pull_varattnos(expr, 1, &expr_attrs);
-			partexprs_item = lnext(partexprs_item);
-
-			if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
-							  expr_attrs))
-			{
-				if (used_in_expr)
-					*used_in_expr = true;
-				return true;
-			}
-		}
-	}
-
-	return false;
-}
-
-/*
  * Return value is the address of the dropped column.
  */
 static ObjectAddress
@@ -6613,7 +6550,9 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 						colName)));
 
 	/* Don't drop columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
@@ -8837,7 +8776,9 @@ ATPrepAlterColumnType(List **wqueue,
 						colName)));
 
 	/* Don't alter columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2983cfa..f3b7849 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -59,6 +59,8 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
+extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
+					bool *used_in_expr);
 
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);

0003-Renaming-parameters-of-map_partition_var_attnos.patchapplication/octet-stream; name=0003-Renaming-parameters-of-map_partition_var_attnos.patchDownload

From 1226fbfa5c1355d99867c4a4fbf456176a1fd090 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Tue, 19 Dec 2017 13:05:30 +0530
Subject: [PATCH] Renaming parameters of map_partition_var_attnos()

---
 src/backend/catalog/partition.c | 17 +++++++++--------
 src/include/catalog/partition.h |  4 ++--
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 8a3b0ed..1189dd9 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1446,7 +1446,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1459,8 +1460,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1469,14 +1470,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index f3b7849..d50bc66 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -54,8 +54,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
-- 
2.1.4

0004-Refactor-CheckConstraint-related-code.patchapplication/octet-stream; name=0004-Refactor-CheckConstraint-related-code.patchDownload

From aca254148a23e03540207f6d95217d99d22fa670 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Tue, 19 Dec 2017 19:12:22 +0530
Subject: [PATCH] Refactor CheckConstraint related code.

---
 src/backend/commands/copy.c            |   2 +-
 src/backend/executor/execMain.c        | 107 +++++++++++++++++++--------------
 src/backend/executor/execPartition.c   |   5 +-
 src/backend/executor/execReplication.c |   4 +-
 src/backend/executor/nodeModifyTable.c |   4 +-
 src/include/executor/executor.h        |   7 ++-
 6 files changed, 74 insertions(+), 55 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f1149ed..40aa511 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2736,7 +2736,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dbaa47f..5ec92d5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index f8f52a1..58ec51c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -170,8 +170,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index bd786a1..995c54c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d5f2cfb..deb0810 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -487,7 +487,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -1049,7 +1049,7 @@ lreplace:;
 		 * tuple-routing is performed here, hence the slot remains unchanged.
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/*
 		 * replace the heap tuple
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index dea9216..c2ef0ce 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
-- 
2.1.4

#214

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Robert Haas (#211)

6 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote:

Reviewing the preparatory patch:
+ PartitionTupleRouting *partition_tuple_routing;
+ /* Tuple-routing support info */
Something's wrong with the formatting here.

Moved the comment above the declaration.

-    PartitionDispatch **pd,
-    ResultRelInfo ***partitions,
-    TupleConversionMap ***tup_conv_maps,
-    TupleTableSlot **partition_tuple_slot,
-    int *num_parted, int *num_partitions)
+    PartitionTupleRouting **partition_tuple_routing)
Since we're consolidating all of ExecSetupPartitionTupleRouting's
output parameters into a single structure, I think it might make more
sense to have it just return that value. I think it's only done with
output parameter today because there are so many different things
being produced, and we can't return them all.

You mean ExecSetupPartitionTupleRouting() will return the structure
(not pointer to structure), and the caller will get the copy of the
structure like this ? :

mtstate->mt_partition_tuple_routing =
ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate);

I am ok with that, but just wanted to confirm if that is what you are
saying. I don't recall seeing a structure return value in PG code, so
not sure if it is conventional in PG to do that. Hence, I am somewhat
inclined to keep it as output param. It also avoids a structure copy.

Another way is for ExecSetupPartitionTupleRouting() to palloc this
structure, and return its pointer, but then caller would have to
anyway do a structure copy, so that's not convenient, and I don't
think you are suggesting this way either.

+ PartitionTupleRouting *ptr = mtstate->mt_partition_tuple_routing;

This is just nitpicking, but I don't find "ptr" to be the greatest
variable name; it looks too much like "pointer". Maybe we could use
"routing" or "proute" or something.

Done. Renamed it to "proute".

It seems to me that we could improve things here by adding a function
ExecCleanupTupleRouting(PartitionTupleRouting *) which would do the
various heap_close(), ExecDropSingleTupleTableSlot(), and
ExecCloseIndices() operations which are currently performed in
CopyFrom() and, by separate code, in ExecEndModifyTable().

Done. Changes are kept in a new preparatory patch
0005-Organize-cleanup-done-for-partition-tuple-routing.patch

Yet to address your other review comments.

Attached is patch v31. (Preparatory patches to be applied in order of
patch numbers, followed by the main patch)

Thanks
-Amit

Attachments:

0001-Encapsulate-partition-related-info-in-a-structure_v2.patchapplication/octet-stream; name=0001-Encapsulate-partition-related-info-in-a-structure_v2.patchDownload

From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 22 Nov 2017 15:59:15 +0530
Subject: Encapsulate partition-related info in a structure.

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 254be28..5c5496f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -166,12 +166,9 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;	/* Number of entries in the above array */
-	int			num_partitions; /* Number of members in the following arrays */
-	ResultRelInfo **partitions; /* Per partition result relation pointers */
-	TupleConversionMap **partition_tupconv_maps;
-	TupleTableSlot *partition_tuple_slot;
+	/* Tuple-routing support info */
+	PartitionTupleRouting *partition_tuple_routing;
+
 	TransitionCaptureState *transition_capture;
 	TupleConversionMap **transition_tupconv_maps;
 
@@ -2472,28 +2469,15 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
+		PartitionTupleRouting *proute;
 
 		ExecSetupPartitionTupleRouting(NULL,
 									   cstate->rel,
 									   1,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		cstate->partition_dispatch_info = partition_dispatch_info;
-		cstate->num_dispatch = num_parted;
-		cstate->partitions = partitions;
-		cstate->num_partitions = num_partitions;
-		cstate->partition_tupconv_maps = partition_tupconv_maps;
-		cstate->partition_tuple_slot = partition_tuple_slot;
+									   &cstate->partition_tuple_routing);
+
+		proute = cstate->partition_tuple_routing;
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2506,11 +2490,11 @@ CopyFrom(CopyState cstate)
 			int			i;
 
 			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * cstate->num_partitions);
-			for (i = 0; i < cstate->num_partitions; ++i)
+				palloc0(sizeof(TupleConversionMap *) * proute->num_partitions);
+			for (i = 0; i < proute->num_partitions; ++i)
 			{
 				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(cstate->partitions[i]->ri_RelationDesc),
+					convert_tuples_by_name(RelationGetDescr(proute->partitions[i]->ri_RelationDesc),
 										   RelationGetDescr(cstate->rel),
 										   gettext_noop("could not convert row type"));
 			}
@@ -2530,7 +2514,7 @@ CopyFrom(CopyState cstate)
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		cstate->partition_dispatch_info != NULL ||
+		cstate->partition_tuple_routing != NULL ||
 		cstate->volatile_defexprs)
 	{
 		useHeapMultiInsert = false;
@@ -2605,10 +2589,11 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_dispatch_info)
+		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
 			TupleConversionMap *map;
+			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2619,11 +2604,11 @@ CopyFrom(CopyState cstate)
 			 * partition, respectively.
 			 */
 			leaf_part_index = ExecFindPartition(resultRelInfo,
-												cstate->partition_dispatch_info,
+												proute->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < cstate->num_partitions);
+				   leaf_part_index < proute->num_partitions);
 
 			/*
 			 * If this tuple is mapped to a partition that is not same as the
@@ -2641,7 +2626,7 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = cstate->partitions[leaf_part_index];
+			resultRelInfo = proute->partitions[leaf_part_index];
 
 			/* We do not yet have a way to insert into a foreign partition */
 			if (resultRelInfo->ri_FdwRoutine)
@@ -2688,7 +2673,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = cstate->partition_tupconv_maps[leaf_part_index];
+			map = proute->partition_tupconv_maps[leaf_part_index];
 			if (map)
 			{
 				Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -2700,7 +2685,7 @@ CopyFrom(CopyState cstate)
 				 * point on.  Use a dedicated slot from this point on until
 				 * we're finished dealing with the partition.
 				 */
-				slot = cstate->partition_tuple_slot;
+				slot = proute->partition_tuple_slot;
 				Assert(slot != NULL);
 				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -2852,8 +2837,9 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_dispatch_info)
+	if (cstate->partition_tuple_routing)
 	{
+		PartitionTupleRouting *proute = cstate->partition_tuple_routing;
 		int			i;
 
 		/*
@@ -2862,23 +2848,23 @@ CopyFrom(CopyState cstate)
 		 * the main target table of COPY that will be closed eventually by
 		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
 		 */
-		for (i = 1; i < cstate->num_dispatch; i++)
+		for (i = 1; i < proute->num_dispatch; i++)
 		{
-			PartitionDispatch pd = cstate->partition_dispatch_info[i];
+			PartitionDispatch pd = proute->partition_dispatch_info[i];
 
 			heap_close(pd->reldesc, NoLock);
 			ExecDropSingleTupleTableSlot(pd->tupslot);
 		}
-		for (i = 0; i < cstate->num_partitions; i++)
+		for (i = 0; i < proute->num_partitions; i++)
 		{
-			ResultRelInfo *resultRelInfo = cstate->partitions[i];
+			ResultRelInfo *resultRelInfo = proute->partitions[i];
 
 			ExecCloseIndices(resultRelInfo);
 			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 		}
 
 		/* Release the standalone partition tuple descriptor */
-		ExecDropSingleTupleTableSlot(cstate->partition_tuple_slot);
+		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 	}
 
 	/* Close any trigger target relations */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d545af2..c21ba55 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -42,22 +42,9 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
  * tuple routing for partitioned tables
  *
  * Output arguments:
- * 'pd' receives an array of PartitionDispatch objects with one entry for
- *		every partitioned table in the partition tree
- * 'partitions' receives an array of ResultRelInfo* objects with one entry for
- *		every leaf partition in the partition tree
- * 'tup_conv_maps' receives an array of TupleConversionMap objects with one
- *		entry for every leaf partition (required to convert input tuple based
- *		on the root table's rowtype to a leaf partition's rowtype after tuple
- *		routing is done)
- * 'partition_tuple_slot' receives a standalone TupleTableSlot to be used
- *		to manipulate any given leaf partition's rowtype after that partition
- *		is chosen by tuple-routing.
- * 'num_parted' receives the number of partitioned tables in the partition
- *		tree (= the number of entries in the 'pd' output array)
- * 'num_partitions' receives the number of leaf partitions in the partition
- *		tree (= the number of entries in the 'partitions' and 'tup_conv_maps'
- *		output arrays
+ *
+ * 'partition_tuple_routing' encapsulates all the partition related information
+ *		required to do tuple routing.
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
@@ -67,29 +54,31 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions)
+							   PartitionTupleRouting **partition_tuple_routing)
 {
 	TupleDesc	tupDesc = RelationGetDescr(rel);
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
 	ResultRelInfo *leaf_part_rri;
+	PartitionTupleRouting *proute;
 
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
 	 */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	*pd = RelationGetPartitionDispatchInfo(rel, num_parted, &leaf_parts);
-	*num_partitions = list_length(leaf_parts);
-	*partitions = (ResultRelInfo **) palloc(*num_partitions *
-											sizeof(ResultRelInfo *));
-	*tup_conv_maps = (TupleConversionMap **) palloc0(*num_partitions *
-													 sizeof(TupleConversionMap *));
+	proute = *partition_tuple_routing =
+		(PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	proute->partition_dispatch_info =
+		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
+										 &leaf_parts);
+	proute->num_partitions = list_length(leaf_parts);
+	proute->partitions = (ResultRelInfo **) palloc(proute->num_partitions *
+												   sizeof(ResultRelInfo *));
+	proute->partition_tupconv_maps =
+		(TupleConversionMap **) palloc0(proute->num_partitions *
+										sizeof(TupleConversionMap *));
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -97,9 +86,9 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 * (such as ModifyTableState) and released when the node finishes
 	 * processing.
 	 */
-	*partition_tuple_slot = MakeTupleTableSlot();
+	proute->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(*num_partitions *
+	leaf_part_rri = (ResultRelInfo *) palloc0(proute->num_partitions *
 											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
@@ -109,7 +98,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 
 		/*
 		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in *partitions are eventually
+		 * Note that each of the relations in proute->partitions are eventually
 		 * closed by the caller.
 		 */
 		partrel = heap_open(lfirst_oid(cell), NoLock);
@@ -119,8 +108,9 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		(*tup_conv_maps)[i] = convert_tuples_by_name(tupDesc, part_tupdesc,
-													 gettext_noop("could not convert row type"));
+		proute->partition_tupconv_maps[i] =
+			convert_tuples_by_name(tupDesc, part_tupdesc,
+								   gettext_noop("could not convert row type"));
 
 		InitResultRelInfo(leaf_part_rri,
 						  partrel,
@@ -149,7 +139,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		(*partitions)[i] = leaf_part_rri++;
+		proute->partitions[i] = leaf_part_rri++;
 		i++;
 	}
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index afb83ed..20807d3 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -279,32 +279,33 @@ ExecInsert(ModifyTableState *mtstate,
 	resultRelInfo = estate->es_result_relation_info;
 
 	/* Determine the partition to heap_insert the tuple into */
-	if (mtstate->mt_partition_dispatch_info)
+	if (mtstate->mt_partition_tuple_routing)
 	{
 		int			leaf_part_index;
+		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
 		 * ExecFindPartition() does not return and errors out instead.
 		 * Otherwise, the returned value is to be used as an index into arrays
-		 * mt_partitions[] and mt_partition_tupconv_maps[] that will get us
-		 * the ResultRelInfo and TupleConversionMap for the partition,
+		 * proute->partitions[] and proute->partition_tupconv_maps[] that will
+		 * get us the ResultRelInfo and TupleConversionMap for the partition,
 		 * respectively.
 		 */
 		leaf_part_index = ExecFindPartition(resultRelInfo,
-											mtstate->mt_partition_dispatch_info,
+											proute->partition_dispatch_info,
 											slot,
 											estate);
 		Assert(leaf_part_index >= 0 &&
-			   leaf_part_index < mtstate->mt_num_partitions);
+			   leaf_part_index < proute->num_partitions);
 
 		/*
 		 * Save the old ResultRelInfo and switch to the one corresponding to
 		 * the selected partition.
 		 */
 		saved_resultRelInfo = resultRelInfo;
-		resultRelInfo = mtstate->mt_partitions[leaf_part_index];
+		resultRelInfo = proute->partitions[leaf_part_index];
 
 		/* We do not yet have a way to insert into a foreign partition */
 		if (resultRelInfo->ri_FdwRoutine)
@@ -352,7 +353,7 @@ ExecInsert(ModifyTableState *mtstate,
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = mtstate->mt_partition_tupconv_maps[leaf_part_index];
+		map = proute->partition_tupconv_maps[leaf_part_index];
 		if (map)
 		{
 			Relation	partrel = resultRelInfo->ri_RelationDesc;
@@ -364,7 +365,7 @@ ExecInsert(ModifyTableState *mtstate,
 			 * on, until we're finished dealing with the partition. Use the
 			 * dedicated slot for that.
 			 */
-			slot = mtstate->mt_partition_tuple_slot;
+			slot = proute->partition_tuple_slot;
 			Assert(slot != NULL);
 			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
 			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
@@ -1500,9 +1501,10 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 		mtstate->mt_oc_transition_capture != NULL)
 	{
 		int			numResultRelInfos;
+		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 
-		numResultRelInfos = (mtstate->mt_partition_tuple_slot != NULL ?
-							 mtstate->mt_num_partitions :
+		numResultRelInfos = (proute != NULL ?
+							 proute->num_partitions :
 							 mtstate->mt_nplans);
 
 		/*
@@ -1515,13 +1517,13 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
 
 		/* Choose the right set of partitions */
-		if (mtstate->mt_partition_dispatch_info != NULL)
+		if (proute != NULL)
 		{
 			/*
 			 * For tuple routing among partitions, we need TupleDescs based on
 			 * the partition routing table.
 			 */
-			ResultRelInfo **resultRelInfos = mtstate->mt_partitions;
+			ResultRelInfo **resultRelInfos = proute->partitions;
 
 			for (i = 0; i < numResultRelInfos; ++i)
 			{
@@ -1833,6 +1835,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	PartitionTupleRouting *proute = NULL;
+	int			num_partitions = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -1946,28 +1950,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	if (operation == CMD_INSERT &&
 		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionDispatch *partition_dispatch_info;
-		ResultRelInfo **partitions;
-		TupleConversionMap **partition_tupconv_maps;
-		TupleTableSlot *partition_tuple_slot;
-		int			num_parted,
-					num_partitions;
-
 		ExecSetupPartitionTupleRouting(mtstate,
 									   rel,
 									   node->nominalRelation,
 									   estate,
-									   &partition_dispatch_info,
-									   &partitions,
-									   &partition_tupconv_maps,
-									   &partition_tuple_slot,
-									   &num_parted, &num_partitions);
-		mtstate->mt_partition_dispatch_info = partition_dispatch_info;
-		mtstate->mt_num_dispatch = num_parted;
-		mtstate->mt_partitions = partitions;
-		mtstate->mt_num_partitions = num_partitions;
-		mtstate->mt_partition_tupconv_maps = partition_tupconv_maps;
-		mtstate->mt_partition_tuple_slot = partition_tuple_slot;
+									   &mtstate->mt_partition_tuple_routing);
+
+		proute = mtstate->mt_partition_tuple_routing;
+		num_partitions = proute->num_partitions;
 	}
 
 	/*
@@ -2010,7 +2000,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
 	 * cases are handled above.
 	 */
-	if (node->withCheckOptionLists != NIL && mtstate->mt_num_partitions > 0)
+	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
 		List	   *wcoList;
 		PlanState  *plan;
@@ -2027,14 +2017,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			   mtstate->mt_nplans == 1);
 		wcoList = linitial(node->withCheckOptionLists);
 		plan = mtstate->mt_plans[0];
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *mapped_wcoList;
 			List	   *wcoExprs = NIL;
 			ListCell   *ll;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = proute->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2103,12 +2093,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * are handled above.
 		 */
 		returningList = linitial(node->returningLists);
-		for (i = 0; i < mtstate->mt_num_partitions; i++)
+		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
-			resultRelInfo = mtstate->mt_partitions[i];
+			resultRelInfo = proute->partitions[i];
 			partrel = resultRelInfo->ri_RelationDesc;
 
 			/* varno = node->nominalRelation */
@@ -2377,29 +2367,33 @@ ExecEndModifyTable(ModifyTableState *node)
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *
-	 * Remember node->mt_partition_dispatch_info[0] corresponds to the root
+	 * Remember proute->partition_dispatch_info[0] corresponds to the root
 	 * partitioned table, which we must not try to close, because it is the
 	 * main target table of the query that will be closed by ExecEndPlan().
 	 * Also, tupslot is NULL for the root partitioned table.
 	 */
-	for (i = 1; i < node->mt_num_dispatch; i++)
+	if (node->mt_partition_tuple_routing)
 	{
-		PartitionDispatch pd = node->mt_partition_dispatch_info[i];
+		PartitionTupleRouting *proute = node->mt_partition_tuple_routing;
 
-		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
-	}
-	for (i = 0; i < node->mt_num_partitions; i++)
-	{
-		ResultRelInfo *resultRelInfo = node->mt_partitions[i];
+		for (i = 1; i < proute->num_dispatch; i++)
+		{
+			PartitionDispatch pd = proute->partition_dispatch_info[i];
 
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
+			heap_close(pd->reldesc, NoLock);
+			ExecDropSingleTupleTableSlot(pd->tupslot);
+		}
+		for (i = 0; i < proute->num_partitions; i++)
+		{
+			ResultRelInfo *resultRelInfo = proute->partitions[i];
+			ExecCloseIndices(resultRelInfo);
+			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+		}
 
-	/* Release the standalone partition tuple descriptor, if any */
-	if (node->mt_partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(node->mt_partition_tuple_slot);
+		/* Release the standalone partition tuple descriptor, if any */
+		if (proute->partition_tuple_slot)
+			ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 86a199d..364d89f 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -49,15 +49,45 @@ typedef struct PartitionDispatchData
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to execute
+ * tuple-routing between partitions.
+ *
+ * partition_dispatch_info		Array of PartitionDispatch objects with one
+ *								entry for every partitioned table in the
+ *								partition tree.
+ * num_dispatch					number of partitioned tables in the partition
+ *								tree (= length of partition_dispatch_info[])
+ * partitions					Array of ResultRelInfo* objects with one entry
+ *								for every leaf partition in the partition tree.
+ * num_partitions				Number of leaf partitions in the partition tree
+ *								(= 'partitions' array length)
+ * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the root table's
+ *								rowtype to a leaf partition's rowtype after
+ *								tuple routing is done)
+ * partition_tuple_slot			TupleTableSlot to be used to manipulate any
+ *								given leaf partition's rowtype after that
+ *								partition is chosen for insertion by
+ *								tuple-routing.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	TupleConversionMap **partition_tupconv_maps;
+	TupleTableSlot *partition_tuple_slot;
+} PartitionTupleRouting;
+
 extern void ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel,
 							   Index resultRTindex,
 							   EState *estate,
-							   PartitionDispatch **pd,
-							   ResultRelInfo ***partitions,
-							   TupleConversionMap ***tup_conv_maps,
-							   TupleTableSlot **partition_tuple_slot,
-							   int *num_parted, int *num_partitions);
+							   PartitionTupleRouting **partition_tuple_routing);
 extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c9a5279..486b415 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -983,15 +983,8 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
-	struct PartitionDispatchData **mt_partition_dispatch_info;
+	struct PartitionTupleRouting *mt_partition_tuple_routing;
 	/* Tuple-routing support info */
-	int			mt_num_dispatch;	/* Number of entries in the above array */
-	int			mt_num_partitions;	/* Number of members in the following
-									 * arrays */
-	ResultRelInfo **mt_partitions;	/* Per partition result relation pointers */
-	TupleConversionMap **mt_partition_tupconv_maps;
-	/* Per partition tuple conversion map */
-	TupleTableSlot *mt_partition_tuple_slot;
 	struct TransitionCaptureState *mt_transition_capture;
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;

0002-Changed-is_partition_attr-to-has_partition_attrs.patchapplication/octet-stream; name=0002-Changed-is_partition_attr-to-has_partition_attrs.patchDownload

From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Wed, 22 Nov 2017 15:59:15 +0530
Subject: Changed is_partition_attr() to has_partition_attrs()

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 5c4018e..8a3b0ed 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -2599,6 +2599,70 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 }
 
 /*
+ * Checks if any of the 'attnums' is a partition key attribute for rel
+ *
+ * Sets *used_in_expr if any of the 'attnums' is found to be referenced in some
+ * partition key expression.  It's possible for a column to be both used
+ * directly and as part of an expression; if that happens, *used_in_expr may
+ * end up as either true or false.  That's OK for current uses of this
+ * function, because *used_in_expr is only used to tailor the error message
+ * text.
+ */
+bool
+has_partition_attrs(Relation rel, Bitmapset *attnums,
+					bool *used_in_expr)
+{
+	PartitionKey key;
+	int			partnatts;
+	List	   *partexprs;
+	ListCell   *partexprs_item;
+	int			i;
+
+	if (attnums == NULL || rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		return false;
+
+	key = RelationGetPartitionKey(rel);
+	partnatts = get_partition_natts(key);
+	partexprs = get_partition_exprs(key);
+
+	partexprs_item = list_head(partexprs);
+	for (i = 0; i < partnatts; i++)
+	{
+		AttrNumber	partattno = get_partition_col_attnum(key, i);
+
+		if (partattno != 0)
+		{
+			if (bms_is_member(partattno - FirstLowInvalidHeapAttributeNumber,
+							  attnums))
+			{
+				if (used_in_expr)
+					*used_in_expr = false;
+				return true;
+			}
+		}
+		else
+		{
+			/* Arbitrary expression */
+			Node	   *expr = (Node *) lfirst(partexprs_item);
+			Bitmapset  *expr_attrs = NULL;
+
+			/* Find all attributes referenced */
+			pull_varattnos(expr, 1, &expr_attrs);
+			partexprs_item = lnext(partexprs_item);
+
+			if (bms_overlap(attnums, expr_attrs))
+			{
+				if (used_in_expr)
+					*used_in_expr = true;
+				return true;
+			}
+		}
+	}
+
+	return false;
+}
+
+/*
  * qsort_partition_hbound_cmp
  *
  * We sort hash bounds by modulus, then by remainder.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d979ce2..64c2185 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -468,7 +468,6 @@ static void RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid,
 								Oid oldRelOid, void *arg);
 static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
 								 Oid oldrelid, void *arg);
-static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
 static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
 static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 					  List **partexprs, Oid *partopclass, Oid *partcollation, char strategy);
@@ -6492,68 +6491,6 @@ ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
 }
 
 /*
- * Checks if attnum is a partition attribute for rel
- *
- * Sets *used_in_expr if attnum is found to be referenced in some partition
- * key expression.  It's possible for a column to be both used directly and
- * as part of an expression; if that happens, *used_in_expr may end up as
- * either true or false.  That's OK for current uses of this function, because
- * *used_in_expr is only used to tailor the error message text.
- */
-static bool
-is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr)
-{
-	PartitionKey key;
-	int			partnatts;
-	List	   *partexprs;
-	ListCell   *partexprs_item;
-	int			i;
-
-	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		return false;
-
-	key = RelationGetPartitionKey(rel);
-	partnatts = get_partition_natts(key);
-	partexprs = get_partition_exprs(key);
-
-	partexprs_item = list_head(partexprs);
-	for (i = 0; i < partnatts; i++)
-	{
-		AttrNumber	partattno = get_partition_col_attnum(key, i);
-
-		if (partattno != 0)
-		{
-			if (attnum == partattno)
-			{
-				if (used_in_expr)
-					*used_in_expr = false;
-				return true;
-			}
-		}
-		else
-		{
-			/* Arbitrary expression */
-			Node	   *expr = (Node *) lfirst(partexprs_item);
-			Bitmapset  *expr_attrs = NULL;
-
-			/* Find all attributes referenced */
-			pull_varattnos(expr, 1, &expr_attrs);
-			partexprs_item = lnext(partexprs_item);
-
-			if (bms_is_member(attnum - FirstLowInvalidHeapAttributeNumber,
-							  expr_attrs))
-			{
-				if (used_in_expr)
-					*used_in_expr = true;
-				return true;
-			}
-		}
-	}
-
-	return false;
-}
-
-/*
  * Return value is the address of the dropped column.
  */
 static ObjectAddress
@@ -6613,7 +6550,9 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 						colName)));
 
 	/* Don't drop columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
@@ -8837,7 +8776,9 @@ ATPrepAlterColumnType(List **wqueue,
 						colName)));
 
 	/* Don't alter columns used in the partition key */
-	if (is_partition_attr(rel, attnum, &is_expr))
+	if (has_partition_attrs(rel,
+							bms_make_singleton(attnum - FirstLowInvalidHeapAttributeNumber),
+							&is_expr))
 	{
 		if (!is_expr)
 			ereport(ERROR,
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 2983cfa..f3b7849 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -59,6 +59,8 @@ extern List *map_partition_varattnos(List *expr, int target_varno,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
+extern bool has_partition_attrs(Relation rel, Bitmapset *attnums,
+					bool *used_in_expr);
 
 extern Oid	get_default_oid_from_partdesc(PartitionDesc partdesc);
 extern Oid	get_default_partition_oid(Oid parentId);

0003-Renaming-parameters-of-map_partition_var_attnos.patchapplication/octet-stream; name=0003-Renaming-parameters-of-map_partition_var_attnos.patchDownload

From 1226fbfa5c1355d99867c4a4fbf456176a1fd090 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Tue, 19 Dec 2017 13:05:30 +0530
Subject: [PATCH] Renaming parameters of map_partition_var_attnos()

---
 src/backend/catalog/partition.c | 17 +++++++++--------
 src/include/catalog/partition.h |  4 ++--
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 8a3b0ed..1189dd9 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1446,7 +1446,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 /*
  * map_partition_varattnos - maps varattno of any Vars in expr from the
- * parent attno to partition attno.
+ * attno's of 'from_rel' partition to the attno's of 'to_rel' partition.
+ * The rels can be both leaf partition or a partitioned table.
  *
  * We must allow for cases where physical attnos of a partition can be
  * different from the parent's.
@@ -1459,8 +1460,8 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * are working on Lists, so it's less messy to do the casts internally.
  */
 List *
-map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row)
 {
 	bool		my_found_whole_row = false;
@@ -1469,14 +1470,14 @@ map_partition_varattnos(List *expr, int target_varno,
 	{
 		AttrNumber *part_attnos;
 
-		part_attnos = convert_tuples_by_name_map(RelationGetDescr(partrel),
-												 RelationGetDescr(parent),
+		part_attnos = convert_tuples_by_name_map(RelationGetDescr(to_rel),
+												 RelationGetDescr(from_rel),
 												 gettext_noop("could not convert row type"));
 		expr = (List *) map_variable_attnos((Node *) expr,
-											target_varno, 0,
+											fromrel_varno, 0,
 											part_attnos,
-											RelationGetDescr(parent)->natts,
-											RelationGetForm(partrel)->reltype,
+											RelationGetDescr(from_rel)->natts,
+											RelationGetForm(to_rel)->reltype,
 											&my_found_whole_row);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index f3b7849..d50bc66 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -54,8 +54,8 @@ extern void check_new_partition_bound(char *relname, Relation parent,
 extern Oid	get_partition_parent(Oid relid);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
 						PartitionBoundSpec *spec);
-extern List *map_partition_varattnos(List *expr, int target_varno,
-						Relation partrel, Relation parent,
+extern List *map_partition_varattnos(List *expr, int fromrel_varno,
+						Relation to_rel, Relation from_rel,
 						bool *found_whole_row);
 extern List *RelationGetPartitionQual(Relation rel);
 extern Expr *get_partition_qual_relid(Oid relid);
-- 
2.1.4

0004-Refactor-CheckConstraint-related-code.patchapplication/octet-stream; name=0004-Refactor-CheckConstraint-related-code.patchDownload

From aca254148a23e03540207f6d95217d99d22fa670 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Tue, 19 Dec 2017 19:12:22 +0530
Subject: [PATCH] Refactor CheckConstraint related code.

---
 src/backend/commands/copy.c            |   2 +-
 src/backend/executor/execMain.c        | 107 +++++++++++++++++++--------------
 src/backend/executor/execPartition.c   |   5 +-
 src/backend/executor/execReplication.c |   4 +-
 src/backend/executor/nodeModifyTable.c |   4 +-
 src/include/executor/executor.h        |   7 ++-
 6 files changed, 74 insertions(+), 55 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f1149ed..40aa511 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2736,7 +2736,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dbaa47f..5ec92d5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index f8f52a1..58ec51c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -170,8 +170,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index bd786a1..995c54c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d5f2cfb..deb0810 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -487,7 +487,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -1049,7 +1049,7 @@ lreplace:;
 		 * tuple-routing is performed here, hence the slot remains unchanged.
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/*
 		 * replace the heap tuple
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index dea9216..c2ef0ce 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
-- 
2.1.4

0005-Organize-cleanup-done-for-partition-tuple-routing.patchapplication/octet-stream; name=0005-Organize-cleanup-done-for-partition-tuple-routing.patchDownload

commit db7139f21cca1d1d6020ee1edbd026b2aae635f6
Author: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date:   Fri Dec 22 16:23:06 2017 +0530

    Organize cleanup done for partition-tuple-routing.
    
    The same code that closes all the partitioned tables, leaf partitions,
    and their indices exists in two places, namely CopyFrom() and
    ExecEndModifyTable. Move this code into a common function
    ExecCleanupTupleRouting().

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index d7638a1..ca73a22 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2838,34 +2838,7 @@ CopyFrom(CopyState cstate)
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (cstate->partition_tuple_routing)
-	{
-		PartitionTupleRouting *proute = cstate->partition_tuple_routing;
-		int			i;
-
-		/*
-		 * Remember cstate->partition_dispatch_info[0] corresponds to the root
-		 * partitioned table, which we must not try to close, because it is
-		 * the main target table of COPY that will be closed eventually by
-		 * DoCopy().  Also, tupslot is NULL for the root partitioned table.
-		 */
-		for (i = 1; i < proute->num_dispatch; i++)
-		{
-			PartitionDispatch pd = proute->partition_dispatch_info[i];
-
-			heap_close(pd->reldesc, NoLock);
-			ExecDropSingleTupleTableSlot(pd->tupslot);
-		}
-		for (i = 0; i < proute->num_partitions; i++)
-		{
-			ResultRelInfo *resultRelInfo = proute->partitions[i];
-
-			ExecCloseIndices(resultRelInfo);
-			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-		}
-
-		/* Release the standalone partition tuple descriptor */
-		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
-	}
+		ExecCleanupTupleRouting(cstate->partition_tuple_routing);
 
 	/* Close any trigger target relations */
 	ExecCleanUpTriggerState(estate);
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 8eaf7db..7d926f5 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -264,6 +264,45 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 }
 
 /*
+ * ExecCleanupTupleRouting -- Clean up objects allocated for partition tuple
+ * routing.
+ *
+ * Close all the partitioned tables, leaf partitions, and their indices.
+ */
+void
+ExecCleanupTupleRouting(PartitionTupleRouting *proute)
+{
+	int		i;
+
+	/*
+	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
+	 * partitioned table, which we must not try to close, because it is the
+	 * main target table of the query that will be closed by callers such as
+	 * ExecEndPlan() or DoCopy().
+	 * Also, tupslot is NULL for the root partitioned table.
+	 */
+	for (i = 1; i < proute->num_dispatch; i++)
+	{
+		PartitionDispatch pd = proute->partition_dispatch_info[i];
+
+		heap_close(pd->reldesc, NoLock);
+		ExecDropSingleTupleTableSlot(pd->tupslot);
+	}
+
+	for (i = 0; i < proute->num_partitions; i++)
+	{
+		ResultRelInfo *resultRelInfo = proute->partitions[i];
+
+		ExecCloseIndices(resultRelInfo);
+		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+	}
+
+	/* Release the standalone partition tuple descriptor, if any */
+	if (proute->partition_tuple_slot)
+		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
+}
+
+/*
  * RelationGetPartitionDispatchInfo
  *		Returns information necessary to route tuples down a partition tree
  *
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index ef0f680..ed12952 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2364,36 +2364,9 @@ ExecEndModifyTable(ModifyTableState *node)
 														   resultRelInfo);
 	}
 
-	/*
-	 * Close all the partitioned tables, leaf partitions, and their indices
-	 *
-	 * Remember proute->partition_dispatch_info[0] corresponds to the root
-	 * partitioned table, which we must not try to close, because it is the
-	 * main target table of the query that will be closed by ExecEndPlan().
-	 * Also, tupslot is NULL for the root partitioned table.
-	 */
+	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (node->mt_partition_tuple_routing)
-	{
-		PartitionTupleRouting *proute = node->mt_partition_tuple_routing;
-
-		for (i = 1; i < proute->num_dispatch; i++)
-		{
-			PartitionDispatch pd = proute->partition_dispatch_info[i];
-
-			heap_close(pd->reldesc, NoLock);
-			ExecDropSingleTupleTableSlot(pd->tupslot);
-		}
-		for (i = 0; i < proute->num_partitions; i++)
-		{
-			ResultRelInfo *resultRelInfo = proute->partitions[i];
-			ExecCloseIndices(resultRelInfo);
-			heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-		}
-
-		/* Release the standalone partition tuple descriptor, if any */
-		if (proute->partition_tuple_slot)
-			ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
-	}
+		ExecCleanupTupleRouting(node->mt_partition_tuple_routing);
 
 	/*
 	 * Free the exprcontext
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 364d89f..1591b53 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -92,5 +92,6 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern void ExecCleanupTupleRouting(PartitionTupleRouting *proute);
 
 #endif							/* EXECPARTITION_H */

update-partition-key_v31.patchapplication/octet-stream; name=update-partition-key_v31.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a4..9d21f9a 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..3c665f0 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,17 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..aaffc4d 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by <command>INSERT</command> into the
+    new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index ca73a22..49f1bf3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2838,7 +2838,7 @@ CopyFrom(CopyState cstate)
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (cstate->partition_tuple_routing)
-		ExecCleanupTupleRouting(cstate->partition_tuple_routing);
+		ExecCleanupTupleRouting(NULL, cstate->partition_tuple_routing);
 
 	/* Close any trigger target relations */
 	ExecCleanUpTriggerState(estate);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 92ae382..73ec872 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	partition-key change, then this function is called once when the row is
+ *	deleted (to capture OLD row), and once when the row is inserted to another
+ *	partition (to capture NEW row).  This is done separately because DELETE and
+ *	INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE event fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for row being inserted,
+		 * whereas newtup is NULL when the event is for row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,17 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so either can be NULL, not both.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7d926f5..b485912 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -60,9 +60,23 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL,
+				  *update_rri = NULL;
+	int			num_update_rri = 0,
+				update_rri_index = 0;
+	bool		is_update = false;
 	PartitionTupleRouting *proute;
 
+	/* Initialization specific to update */
+	if (mtstate && mtstate->operation == CMD_UPDATE)
+	{
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+		is_update = true;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+	}
+
 	/*
 	 * Get the information about the partition tree after locking all the
 	 * partitions.
@@ -80,6 +94,45 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		(TupleConversionMap **) palloc0(proute->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	if (is_update)
+	{
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, set update_rri_index to the first per-subplan result
+		 * rel, and then shift it as we find them one by one while scanning the
+		 * leaf partition oids.
+		 */
+		update_rri_index = 0;
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		proute->subplan_partition_offsets =
+			palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		proute->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(proute->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -88,20 +141,67 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 */
 	proute->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(proute->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				proute->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = leaf_part_arr + i;
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in proute->partitions are eventually
-		 * closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * proute->partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -112,14 +212,10 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 			convert_tuples_by_name(tupDesc, part_tupdesc,
 								   gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for insert operation. Even
+		 * for updates, we are doing this for tuple-routing, so again, we need
+		 * to check the validity for insert operation.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -139,9 +235,15 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		proute->partitions[i] = leaf_part_rri++;
+		proute->partitions[i] = leaf_part_rri;
 		i++;
 	}
+
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
 }
 
 /*
@@ -268,11 +370,18 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
  * routing.
  *
  * Close all the partitioned tables, leaf partitions, and their indices.
+ *
+ * 'mtstate' can be NULL if it is not available to the caller; e.g. for COPY.
+ * It is used only in case of updates, for accessing per-subplan result rels.
  */
 void
-ExecCleanupTupleRouting(PartitionTupleRouting *proute)
+ExecCleanupTupleRouting(ModifyTableState *mtstate,
+						PartitionTupleRouting *proute)
 {
 	int		i;
+	bool	is_update = (mtstate && mtstate->operation == CMD_UPDATE);
+	ResultRelInfo *first_resultRelInfo = NULL;
+	ResultRelInfo *last_resultRelInfo = NULL;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -289,15 +398,34 @@ ExecCleanupTupleRouting(PartitionTupleRouting *proute)
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
+	/* Save the positions of first and last UPDATE subplan result rels */
+	if (is_update)
+	{
+		first_resultRelInfo = mtstate->resultRelInfo;
+		last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans;
+	}
+
 	for (i = 0; i < proute->num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
+		/*
+		 * If this result rel is one of the UPDATE subplan result rels, let
+		 * ExecEndPlan() close it. For INSERT or COPY, this does not apply
+		 * because leaf partition result rels are always newly allocated.
+		 */
+		if (is_update &&
+			resultRelInfo >= first_resultRelInfo &&
+			resultRelInfo < last_resultRelInfo)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (proute->root_tuple_slot)
+		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 	if (proute->partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index ed12952..0c43bae 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,16 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf);
+static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+										   TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +251,38 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+						  TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +308,9 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,7 +328,6 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -332,8 +376,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart == true);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -346,30 +392,21 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart == true);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = proute->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = proute->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(mtstate,
+										  proute->partition_tupconv_maps[leaf_part_index],
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -450,6 +487,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -467,14 +505,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -623,9 +668,32 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tables, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE).  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured NEW TABLE row, any AR INSERT
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +747,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tuple_deleted,
+		   bool process_returning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +756,12 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *transition_capture;
+
+	transition_capture = mtstate->mt_transition_capture;
+
+	if (tuple_deleted)
+		*tuple_deleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +926,39 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform that to the caller */
+	if (tuple_deleted)
+		*tuple_deleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE if we are capturing transition tables. We need to
+	 * do this separately for DELETE and INSERT because they happen on
+	 * different tables.
+	 */
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * Now that we have already captured OLD TABLE row, any AR DELETE
+		 * trigger should not again capture it below. Arrange for the same.
+		 */
+		transition_capture = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 transition_capture);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (process_returning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1051,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1019,6 +1123,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1034,22 +1139,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case, we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run with a leaf partition, we would not have
+			 * partition tuple routing setup. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (proute == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip INSERT as
+			 * well, otherwise, there will be effectively one new row inserted.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by
+			 * the EvalPlanQual machinery, but for an UPDATE that we've
+			 * translated into a DELETE from this partition and an INSERT into
+			 * some other partition, that's not available, because CTID chains
+			 * can't span relation boundaries.  We mimic the semantics to a
+			 * limited extent by skipping the INSERT if the DELETE fails to
+			 * find a tuple. This ensures that two concurrent attempts to
+			 * UPDATE the same tuple at the same time can't turn one tuple
+			 * into two, and that an UPDATE of a just-deleted tuple can't
+			 * resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * UPDATEs set the transition capture map only when a new subplan
+			 * is chosen.  But for INSERTs, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INESRT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(mtstate,
+											  tupconv_map,
+											  tuple,
+											  proute->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate, true);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1702,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1507,55 +1731,142 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 							 proute->num_partitions :
 							 mtstate->mt_nplans);
 
+		ExecSetupChildParentMap(mtstate, targetRelInfo, numResultRelInfos,
+								(proute != NULL));
+
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (proute != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = proute->partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes :
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ * result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need
+ * to convert the tuple from subplan result rel to target table descriptor,
+ * and for INSERTs, we need to convert the inserted tuple from leaf partition
+ * to the target table descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate,
+						ResultRelInfo *rootRelInfo,
+						int numResultRelInfos, bool perleaf)
+{
+	TupleDesc	outdesc;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
+		/*
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
+		 */
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based on the
+		 * partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		Assert(mtstate->mt_partition_tuple_routing != NULL);
+		resultRelInfos = mtstate->mt_partition_tuple_routing->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+		Assert(proute && proute->subplan_partition_offsets != NULL);
+		leaf_index = proute->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < proute->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1973,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2096,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1832,9 +2142,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *resultRelInfo;
 	TupleDesc	tupDesc;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 	PartitionTupleRouting *proute = NULL;
 	int			num_partitions = 0;
 
@@ -1909,6 +2222,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1946,9 +2269,19 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		ExecSetupPartitionTupleRouting(mtstate,
 									   rel,
@@ -1958,6 +2291,13 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		proute = mtstate->mt_partition_tuple_routing;
 		num_partitions = proute->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1968,6 +2308,18 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, getASTriggerResultRelInfo(mtstate),
+								mtstate->mt_nplans, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1997,26 +2349,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2025,17 +2380,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2052,7 +2416,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2089,22 +2453,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2366,7 +2743,7 @@ ExecEndModifyTable(ModifyTableState *node)
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (node->mt_partition_tuple_routing)
-		ExecCleanupTupleRouting(node->mt_partition_tuple_routing);
+		ExecCleanupTupleRouting(node, node->mt_partition_tuple_routing);
 
 	/*
 	 * Free the exprcontext
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 84d7171..41e28bc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2263,6 +2264,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(is_partition_key_update);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 2e869a9..b4b7639 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(is_partition_key_update);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e468d7c..2bdc058 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2105,6 +2106,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2527,6 +2529,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(is_partition_key_update);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 1133c70..de55a3a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0e8463e..be0d162 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,7 +1364,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1403,7 +1403,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 1a0d3a8..fe34862 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -279,6 +279,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2373,6 +2374,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6443,6 +6445,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6469,6 +6472,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 382791f..8b37609 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6155,17 +6159,22 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets is_partition_key_update
+ *		to true if any of the root rte's updated columns is a partition key.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (is_partition_key_update)
+		*is_partition_key_update = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6173,6 +6182,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (is_partition_key_update)
+				*is_partition_key_update = pc->is_partition_key_update;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index a24e8ac..c6e1b9e 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1467,16 +1468,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		is_partition_key_update = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &is_partition_key_update);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1493,6 +1497,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->is_partition_key_update = is_partition_key_update;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1569,7 +1574,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1584,6 +1590,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*is_partition_key_update)
+		*is_partition_key_update =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1623,7 +1640,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   is_partition_key_update);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 2aee156..eb288f7 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3268,6 +3268,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the named relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3281,6 +3283,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3348,6 +3351,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 1591b53..fb0cbd0 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -67,6 +67,9 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -80,7 +83,9 @@ typedef struct PartitionTupleRouting
 	ResultRelInfo **partitions;
 	int			num_partitions;
 	TupleConversionMap **partition_tupconv_maps;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
 extern void ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
@@ -92,6 +97,7 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern void ExecCleanupTupleRouting(PartitionTupleRouting *proute);
+extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
+						PartitionTupleRouting *proute);
 
 #endif							/* EXECPARTITION_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 486b415..cf0b8bb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -989,8 +989,9 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d763da6..e858598 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 3b9d303..51bf47b 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1674,6 +1674,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2124,6 +2125,9 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		is_partition_key_update;	/* is the partition key of any of
+											 * the partitioned tables
+											 * updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 3ef12b3..d91962d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -242,6 +242,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 2801bfd..9f0533c 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..0dfd3a6 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,441 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +640,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +703,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +829,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..53c6441 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,311 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+
+:init_range_parted;
+set session authorization regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +420,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +449,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +548,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#215

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Robert Haas (#212)

Re: [HACKERS] UPDATE of partition key

On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote:

started another review pass over the main patch, so here are
some comments about that.

I am yet to address all the comments, but meanwhile, below are some
specific points ...

+ if (!partrel)
+ {
+ /*
+ * We locked all the partitions above including the leaf
+ * partitions. Note that each of the newly opened relations in
+ * *partitions are eventually closed by the caller.
+ */
+ partrel = heap_open(leaf_oid, NoLock);
+ InitResultRelInfo(leaf_part_rri,
+   partrel,
+   resultRTindex,
+   rel,
+   estate->es_instrument);
+ }
Hmm, isn't there a problem here? Before, we opened all the relations
here and the caller closed them all. But now, we're only opening some
of them. If the caller closes them all, then they will be closing
some that we opened and some that we didn't. That seems quite bad,
because the reference counts that are incremented and decremented by
opening and closing should all end up at 0. Maybe I'm confused
because it seems like this would break in any scenario where even 1
relation was already opened and surely you must have tested that
case... but if there's some reason this works, I don't know what it
is, and the comment doesn't tell me.

In ExecCleanupTupleRouting(), we are closing only those newly opened
partitions. We skip those which are actually part of the update result
rels.

+ /*
+ * UPDATEs set the transition capture map only when a new subplan
+ * is chosen.  But for INSERTs, it is set for each row. So after
+ * INSERT, we need to revert back to the map created for UPDATE;
+ * otherwise the next UPDATE will incorrectly use the one created
+ * for INESRT.  So first save the one created for UPDATE.
+ */
+ if (mtstate->mt_transition_capture)
+ saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
I wonder if there is some more elegant way to handle this problem.
Basically, the issue is that ExecInsert() is stomping on
mtstate->mt_transition_capture, and your solution is to save and
restore the value you want to have there. But maybe we could instead
find a way to get ExecInsert() not to stomp on that state in the first
place. It seems like the ON CONFLICT stuff handled that by adding a
second TransitionCaptureState pointer to ModifyTable, thus
mt_transition_capture and mt_oc_transition_capture. By that
precedent, we could add mt_utr_transition_capture or similar, and
maybe that's the way to go. It seems a bit unsatisfying, but so does
what you have now.

In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

+ * If per-leaf map is required and the map is already created, that map
+ * has to be per-leaf. If that map is per-subplan, we won't be able to
+ * access the maps leaf-partition-wise. But if the map is per-leaf, we
+ * will be able to access the maps subplan-wise using the
+ * subplan_partition_offsets map using function
+ * tupconv_map_for_subplan().  So if the callers might need to access
+ * the map both leaf-partition-wise and subplan-wise, they should make
+ * sure that the first time this function is called, it should be
+ * called with perleaf=true so that the map created is per-leaf, not
+ * per-subplan.
This sounds complicated and fragile. It ends up meaning that
mt_childparent_tupconv_maps is sometimes indexed by subplan number and
sometimes by partition leaf index, which is extremely confusing and
likely to lead to coding errors, either in this patch or in future
ones.

Even if we always index the map by leaf partition, while accessing the
map the code still needs to be aware of whether the index number with
which we are accessing the map is the subplan number or leaf partition
number:

If the access is by subplan number, use subplan_partition_offsets to
convert to the leaf partition index. So the function
tupconv_map_for_subplan() is anyways necessary for accessing using
subplan index. Only thing that will change is :
tupconv_map_for_subplan() will not have to check if the the map is
indexed by leaf partition or not. But that complexity is hidden in
this function alone; the outside code need not worry about that.

If the access is by leaf partition number, I think you are worried
here that the map might have been incorrectly indexed by subplan, and
the code might access it partition-wise. Currently we access the map
by leaf-partition-index only when setting up
mtstate->mt_*transition_capture->tcs_map during inserts. At that
place, there is an Assert(mtstate->mt_is_tupconv_perpart == true). May
be, we can have another function tupconv_map_for_partition() rather
than directly accessing mt_childparent_tupconv_maps[], and have this
Assert() in that function. What do you say ?

I am more inclined towards avoiding an always-leaf-partition-indexed
map for additional reasons mentioned below ...

Would it be reasonable to just always do this by partition leaf
index, even if we don't strictly need that set of mappings?

If there are no transition tables in picture, we don't require
per-leaf child-parent conversion. So, this would mean that the tuple
conversion maps will be set up for all (say, 100) leaf partitions even
if there are only, say, a couple of update plans. I feel this would
unnecessarily increase the startup cost of update-partition-key
operation.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#216

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: Amit Khandekar (#214)

Re: [HACKERS] UPDATE of partition key

On 23 December 2017 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote:
-    PartitionDispatch **pd,
-    ResultRelInfo ***partitions,
-    TupleConversionMap ***tup_conv_maps,
-    TupleTableSlot **partition_tuple_slot,
-    int *num_parted, int *num_partitions)
+    PartitionTupleRouting **partition_tuple_routing)
Since we're consolidating all of ExecSetupPartitionTupleRouting's
output parameters into a single structure, I think it might make more
sense to have it just return that value. I think it's only done with
output parameter today because there are so many different things
being produced, and we can't return them all.
You mean ExecSetupPartitionTupleRouting() will return the structure
(not pointer to structure), and the caller will get the copy of the
structure like this ? :

mtstate->mt_partition_tuple_routing =
ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate);

I am ok with that, but just wanted to confirm if that is what you are
saying. I don't recall seeing a structure return value in PG code, so
not sure if it is conventional in PG to do that. Hence, I am somewhat
inclined to keep it as output param. It also avoids a structure copy.

Another way is for ExecSetupPartitionTupleRouting() to palloc this
structure, and return its pointer, but then caller would have to
anyway do a structure copy, so that's not convenient, and I don't
think you are suggesting this way either.

I'm pretty sure Robert is suggesting that
ExecSetupPartitionTupleRouting pallocs the memory for the structure,
sets it up then returns a pointer to the new struct. That's not very
unusual. It seems unusual for a function to return void and modify a
single parameter pointer to get the value to the caller rather than
just to return that value.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#217

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Robert Haas (#212)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote:

- map = ptr->partition_tupconv_maps[leaf_part_index];
+ map = ptr->parentchild_tupconv_maps[leaf_part_index];
I don't think there's any reason to rename this. In previous patch
versions, you had multiple arrays of tuple conversion maps in this
structure, but the refactoring eliminated that.

Done in an earlier version of the patch.

Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
-> mt_childparent_tupconv_maps. That seems like it could similarly be
left alone.

We need to change it's name because now this map is not only used for
transition capture, but also for update-tuple-routing. Does it look ok
for you if, for readability, we keep the childparent tag ? Or else, we
can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
looks more informative.

+ /*
+ * If transition tables are the only reason we're here, return. As
+ * mentioned above, we can also be here during update tuple routing in
+ * presence of transition tables, in which case this function is called
+ * separately for oldtup and newtup, so either can be NULL, not both.
+ */
if (trigdesc == NULL ||
(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
- (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+ (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

I had given a thought on this earlier. I felt, even the pre-existing
conditions like "!trigdesc->trig_update_after_row" are all indirect
ways to determine that this function is called only to capture
transition tables, and thought that it may have been better to have
separate parameter transition_table_only.

But then decided that I can continue on similar lines and add another
such condition to indicate that we are only capturing update-routed
tuples.

Instead of adding another parameter to AfterTriggerSaveEvent(), I had
also considered another approach: Put the transition-tuples-capture
logic part of AfterTriggerSaveEvent() into a helper function
CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
of calling ExecARUpdateTriggers(), call this function
CaptureTransitionTables(). I then dropped this idea and thought rather
to call ExecARUpdateTriggers() which neatly does the required checks
and other things like locking the old tuple via GetTupleForTrigger().
So if we go by CaptureTransitionTables(), we would need to do what
ExecARUpdateTriggers() does before calling CaptureTransitionTables().
This is doable. If you think this is worth doing so as to get rid of
the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.

+ /* Initialization specific to update */
+ if (mtstate && mtstate->operation == CMD_UPDATE)
+ {
+ ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+ is_update = true;
+ update_rri = mtstate->resultRelInfo;
+ num_update_rri = list_length(node->plans);
+ }
I guess I don't see why we need a separate "if" block for this.
Neither is_update nor update_rri nor num_update_rri are used until we
get to the block that begins with "if (is_update)". Why not just
change that block to test if (mtstate && mtstate->operation ==
CMD_UPDATE)" and put the rest of these initializations inside that
block?

Done.

+ int num_update_rri = 0,
+ update_rri_index = 0;
...
+ update_rri_index = 0;

It's already 0.

Done. Retained the comment that mentions why we need to set it to 0,
and added a note in the end that we have already done this during
initialization.

+ leaf_part_rri = &update_rri[update_rri_index];
...
+ leaf_part_rri = leaf_part_arr + i;
These are doing the same kind of thing, but using different styles. I
prefer the former style, so I'd change the second one to
&leaf_part_arr[i]. Alternatively, you could change the first one to
update_rri + update_rri_indx. But it's strange to see the same
variable initialized in two different ways just a few lines apart.

Done. Used the first style.

+static HeapTuple
+ConvertPartitionTupleSlot(ModifyTableState *mtstate,
+   TupleConversionMap *map,
+   HeapTuple tuple,
+   TupleTableSlot *new_slot,
+   TupleTableSlot **p_my_slot)

This function doesn't use the mtstate argument at all.

Removed mtstate.

+ * (Similarly we need to add the deleted row in OLD TABLE). We need to do

The period should be before, not after, the closing parenthesis.

Done.

+ * Now that we have already captured NEW TABLE row, any AR INSERT
+ * trigger should not again capture it below. Arrange for the same.
A more American style would be something like "We've already captured
the NEW TABLE row, so make sure any AR INSERT trigger fired below
doesn't capture it again." (Similarly for the other case.)

Done.

+ /* The delete has actually happened, so inform that to the caller */
+ if (tuple_deleted)
+ *tuple_deleted = true;
In the US, we inform the caller, not inform that to the caller. In
other words, here the direct object of "inform" is the person or thing
getting the information (in this case, "the caller"), not the
information being conveyed (in this case, "that"). I realize your
usage is probably typical for your country...

Changed it to "inform the caller about the same"

+ Assert(mtstate->mt_is_tupconv_perpart == true);

We usually just Assert(thing_that_should_be_true), not
Assert(thing_that_should_be_true == true).

Ok. Changed it to Assert(mtstate->mt_is_tupconv_perpart)

+ * In case this is part of update tuple routing, put this row into the
+ * transition OLD TABLE if we are capturing transition tables. We need to
+ * do this separately for DELETE and INSERT because they happen on
+ * different tables.
Maybe "...OLD table, but only if we are..."

Should it be capturing transition tables or capturing transition
tuples? I'm not sure.

Changed it to "capturing transition tuples". In trigger.c, I see this
short form notation as well as a long-form notation like "capturing
tuples in transition tables". But not seen anywhere "capturing
transition tables", and it does seem odd.

+ * partition, in which case, we should check the RLS CHECK policy just

In the US, the second comma in this sentence is incorrect and should be removed.

Done.

+ * When an UPDATE is run with a leaf partition, we would not have
+ * partition tuple routing setup. In that case, fail with
run with -> run on
would not -> will not
setup -> set up

Done.

+ * deleted by another transaction), then we should skip INSERT as
+ * well, otherwise, there will be effectively one new row inserted.
skip INSERT -> skip the insert
well, otherwise -> well; otherwise

I would also change "there will be effectively one new row inserted"
to "an UPDATE could cause an increase in the total number of rows
across all partitions, which is clearly wrong".

Done both.

+ /*
+ * UPDATEs set the transition capture map only when a new subplan
+ * is chosen.  But for INSERTs, it is set for each row. So after
+ * INSERT, we need to revert back to the map created for UPDATE;
+ * otherwise the next UPDATE will incorrectly use the one created
+ * for INESRT.  So first save the one created for UPDATE.
+ */
+ if (mtstate->mt_transition_capture)
+ saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

UPDATEs -> Updates

Done. I believe you want to do this only if it's a plural ? In the
same para, also changed "INSERTs" to "inserts".

INESRT -> INSERT

Done.

+ * 2. For capturing transition tables that are partitions. For UPDATEs, we need

This isn't worded well. A transition table is never a partition;
transition tables and partitions are two different kinds of things.

Yeah. Changed it to :
"For capturing transition tuples when the target table is a partitioned table."

Attached v32 patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#218

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: David Rowley (#216)

Re: [HACKERS] UPDATE of partition key

On 2 January 2018 at 10:56, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 23 December 2017 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 15 December 2017 at 18:28, Robert Haas <robertmhaas@gmail.com> wrote:
-    PartitionDispatch **pd,
-    ResultRelInfo ***partitions,
-    TupleConversionMap ***tup_conv_maps,
-    TupleTableSlot **partition_tuple_slot,
-    int *num_parted, int *num_partitions)
+    PartitionTupleRouting **partition_tuple_routing)
Since we're consolidating all of ExecSetupPartitionTupleRouting's
output parameters into a single structure, I think it might make more
sense to have it just return that value. I think it's only done with
output parameter today because there are so many different things
being produced, and we can't return them all.
You mean ExecSetupPartitionTupleRouting() will return the structure
(not pointer to structure), and the caller will get the copy of the
structure like this ? :

mtstate->mt_partition_tuple_routing =
ExecSetupPartitionTupleRouting(mtstate, rel, node->nominalRelation, estate);

I am ok with that, but just wanted to confirm if that is what you are
saying. I don't recall seeing a structure return value in PG code, so
not sure if it is conventional in PG to do that. Hence, I am somewhat
inclined to keep it as output param. It also avoids a structure copy.

Another way is for ExecSetupPartitionTupleRouting() to palloc this
structure, and return its pointer, but then caller would have to
anyway do a structure copy, so that's not convenient, and I don't
think you are suggesting this way either.
I'm pretty sure Robert is suggesting that
ExecSetupPartitionTupleRouting pallocs the memory for the structure,
sets it up then returns a pointer to the new struct. That's not very
unusual. It seems unusual for a function to return void and modify a
single parameter pointer to get the value to the caller rather than
just to return that value.

Sorry, my mistake. Earlier I somehow was under the impression that the
callers of ExecSetupPartitionTupleRouting() already have this
structure palloc'ed, and that they pass address of this structure. I
now can see that both CopyStateData->partition_tuple_routing and
ModifyTableState->mt_partition_tuple_routing are pointers, not
structures. So it make perfect sense for
ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry
for the noise. Will share the change in an upcoming patch version.
Thanks !

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#219

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#215)

Re: [HACKERS] UPDATE of partition key

On 1 January 2018 at 21:43, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 16 December 2017 at 03:09, Robert Haas <robertmhaas@gmail.com> wrote:
+ /*
+ * UPDATEs set the transition capture map only when a new subplan
+ * is chosen.  But for INSERTs, it is set for each row. So after
+ * INSERT, we need to revert back to the map created for UPDATE;
+ * otherwise the next UPDATE will incorrectly use the one created
+ * for INESRT.  So first save the one created for UPDATE.
+ */
+ if (mtstate->mt_transition_capture)
+ saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
I wonder if there is some more elegant way to handle this problem.
Basically, the issue is that ExecInsert() is stomping on
mtstate->mt_transition_capture, and your solution is to save and
restore the value you want to have there. But maybe we could instead
find a way to get ExecInsert() not to stomp on that state in the first
place. It seems like the ON CONFLICT stuff handled that by adding a
second TransitionCaptureState pointer to ModifyTable, thus
mt_transition_capture and mt_oc_transition_capture. By that
precedent, we could add mt_utr_transition_capture or similar, and
maybe that's the way to go. It seems a bit unsatisfying, but so does
what you have now.
In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
insert_tcs_maps for UPDATE/DELETE and INSERT events respectively. So
upd_del_tcs_maps will be updated only after we start with the next
UPDATE subplan, whereas insert_tcs_maps will keep on getting updated
for each row. So in AfterTriggerSaveEvent(), upd_del_tcs_maps would be
used when the event is TRIGGER_EVENT_[UPDATE/DELETE], and
insert_tcs_maps will be used when event == TRIGGER_EVENT_INSERT. But
the issue is : even if the event is TRIGGER_EVENT_UPDATE, we don't
know whether this is caused by a normal update or as part of an insert
into new partition during partition-key-update. So blindly using
upd_del_tcs_maps is incorrect. If the event is caused by the later, we
should use insert_tcs_maps rather than upd_del_tcs_maps. But we do not
have the information in trigger.c as to what caused this event.

So, overall, it would not work, and even if we make it work by passing
or storing some more information somewhere, the
AfterTriggerSaveEvent() logic will become too complicated.

So I can't think of anything else, but to keep the way I did, i.e.
reverting back the tcs_map once insert finishes. We so a similar thing
for reverting back the estate->es_result_relation_info.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#220

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Langote (#210)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 20 December 2017 at 11:52, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 14 December 2017 at 08:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Regarding ExecSetupChildParentMap(), it seems to me that it could simply
be declared as

static void ExecSetupChildParentMap(ModifyTableState *mtstate);

Looking at the places from where it's called, it seems that you're just
extracting information from mtstate and passing the same for the rest of
its arguments.

Agreed. But the last parameter per_leaf might be necessary. I will
defer this until I address Robert's concern about the complexity of
the related code.

Removed those parameters, but kept perleaf. The map required for
update-tuple-routing is a per-subplan one despite the presence of
partition tuple routing. And we cannot deduce from mtstate whether
update tuple routing is true. So for this case, the caller has to
explicitly specify that per-subplan map has to be created.

tupconv_map_for_subplan() looks like it could be done as a macro.

Or may be inline function. I will again defer this for similar reason
as the above deferred item about ExecSetupChildParentMap parameters.

Made it inline.

Did the above changes in attached update-partition-key_v33.patch

On 3 January 2018 at 11:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 2 January 2018 at 10:56, David Rowley <david.rowley@2ndquadrant.com> wrote:

I'm pretty sure Robert is suggesting that
ExecSetupPartitionTupleRouting pallocs the memory for the structure,
sets it up then returns a pointer to the new struct. That's not very
unusual. It seems unusual for a function to return void and modify a
single parameter pointer to get the value to the caller rather than
just to return that value.

Sorry, my mistake. Earlier I somehow was under the impression that the
callers of ExecSetupPartitionTupleRouting() already have this
structure palloc'ed, and that they pass address of this structure. I
now can see that both CopyStateData->partition_tuple_routing and
ModifyTableState->mt_partition_tuple_routing are pointers, not
structures. So it make perfect sense for
ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry
for the noise. Will share the change in an upcoming patch version.
Thanks !

ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.

Did this change in v3 version of
0001-Encapsulate-partition-related-info-in-a-structure.patch

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v33.patch.tar.gzapplication/x-gzip; name=update-partition-key_v33.patch.tar.gzDownload

���LZ�=ks�F��J���~p�6"EZ�ldI���,[��U�����50xH�9����3f�����]EU6)`�����w�F��xp���>�#�����vh{���x������� ��#��n����������g~xH�G�}�&���h6����O'��x:��'��&����wl��#� 4|�j�Y��xQ���������v��������Z�P���>���wC��};��zhz���3��_��g�)��e��x���g������F����C��7�gLa0�c0��1�%6<8����
����4���k���]�) ������l]u���w��-���1���������%�A5��^�W3�O?��x>��'���KW��=��F�;��Bx[ �q�Zk�sZ�`;|#�����v�0��c����Y<����~ea�1��
���Bc���v0h���{n�AY7�����VH�cm�a��r�]r�x���G`�}��8��3c�����o�!"*�d���i v�% 6��x��{-��{����%�B����@%��O�q���l�O	�*�;��so�
�}m�Ux�0��ps�~<��_!?8^����2\=>�!Yh��!���~���`^��������r%
+y+��3c��OL��ar}e��8�������1������?Y�p��?!s"7��o'l��?:0�`1{�����G�4�g��'w>��b''������o�W�N�_]^]�}sq��:}���|�����o�si�-���^�3���_|��
�����W������B���q�wN#3����tXP@�Q�(I���F�/b��2~���	�b��iU$�#���s*� I�[�"�U2i(�x��YI����8)(8�#~(e����G7bL���$�uPZ�H9��u�������@�gI
���>J�;�m�r�����\������b�_��N�m�HB��Di���`��j������)���,��������
3�����^2���zv����b]f��Z<X����B���7�#����&�
&
V���k�x��T�?�������3��c���8��/�l�F�`YK�=��+�_�0�����y���L/r,����82��%�od}x������l;-���kQI�}Ui�k\��5���?a����'��*����� ������e�A�����G�
��t��29FF��%Z,l��,�3e�
#��������}IHso'1G4��fh����d	�(D�M���+v���m=�6���<�tc�,���r2FS�D���T�������f����@�8���_��pcC�{E����"�{Y��JR�3b9+$>���[�� ��=6��E{�:���v�l���@���R��B����fb�a��\�������P�J��"��_B��K�\Z�{ �a�''����:ZJ��T��/�2U;�
?��G�0�2+{�L�HEQ�H�n�.7�o�WS�yj|�J\]�pk�x9 q��w�\��1I��d�@Z��7W�Xg�.�I_�����p����.���)�Y"}��>X����Dd�rfyD�;���n1�
b$�dI��$��]���QL����V�e ���b!h<_4�1 ����ab��Zo���\{�������c|?]�8�I�xI��b��G������Zp[3yb�����
�rH���
�h$��|1jDR"
�)c����l��Wd�K��N�G������?�5
B�����AM�4��������#��-�IT9R �>G}��\J$z@�s)�b/=f2��=h��P�ySN��=�L3�~���d1��1B��ND��rAp�3��%($��.���(�A������}AVS�>3@��`�g��-�\�4C�����k���.�vA�������p�_j��2hg`����t�����w�%t��
�\��HU�� ����$�s��;�GN����n!�]A&#�68������b���}�$V��^�,��hU	fF��5h���V!��eO%c~�����P"dn1�{���T���^y�'�����}o����z<����j���Zs���7u�e"u
�pL��'���f�;_'l�6�i�5�`7�!�{�����6�LV"%�V.V��]�ZW�t��kF��`��[x
����q��dZir��}I�<�e)&�-��9�L�ph�Msqh��[J��.�CQ�M���8�7��6��5`y���#������/�]���{�Y�Q�6������yc�u�+H������������	�h��@Q5�"*v�qG����o�p�����dp������~�M���d�Dx^Ry�������i�;�A����]�8��l�s9{�]K���}�!
�~��(�JL�
��,r�V����#��Mk��E��a	P��*<"��ju��]�Z@*fP�����NxE ���J�L>���,Jp��w�k�E�%�5�pkQD� �-@������x�8o6h�}Y���$�[ �o��>�<�I]��������1�-c)>e:483�+GA58�55A���5�5}�9I�����p���l�5)��
��~
����
O�P�f���:P�J:�_
��{��M.�pT�l�Y��{��8���m"�������z%�A
a��=P�wb����.�����4��|p����D�H=j�/$�l�ndO�{+3��1UR����\IWc��x%uZ�;��@-�U����J�lY��I��	,�/q>ux����I�GE��X-�HA|��{�>�Z�Xe��48���qE�_���QFYu�3�euk/
7�V�
s�G�I�9L��m���hW0��-�m0�5�_m���m�~����(�����u��)e�\��b�������`m`;��&�k�,��������1����D���������Q����>I'��
���*R��_����"Y��g3����`\��%St��<l��U.���7F`u����W7��@{�����j�4��n��L*�`����<�/Y�h)�d�\�W;�@���H�N~��n�\(ku�aR�8�YvLFo
���k�S��-_�����M���U��{�x�����El)B�"�_0`��6j�>S���"��so)�)�X^�������CY�g��
,ipe��!�=���w@���t�G[.
��i�@jd@����H�0�X��8�Ob���d($����v6��
�%-AG���������`���6��H	������X�B�+a��A&}����Q=45����^�
AXI��4��oRL2���U\�+���+SP���^���Y���A��m��&�c��D�{|���7�Ndz�Q�eP^�f��*�
'"���Y�j�
W
�����.�_���RP23��+��I�+1y���X}���B��(cn��3����������O���q�<��Hc~�H .��	���4�Q�R���h������5�2����J��r�]�����U���+-]i��j0�Yi���Y�"���I:���(}�<�f�}p�0��d����-������J
�y~c8�X�7���h"�J��V%K��G�7K�T�~TI��G1�B7jPQ��\2A_��� ����
J�@�Q(	*�j���d�P�=�	��F(_"�V5(*,�Qq.�z@�P���v��@���,��1"%�/���k���zJn���G�\�b����%+�=yx��7��iV�3�M�������D�<��G	a�����HR�����������5?����?��!�u>�']� X�H�<�(��!��*[,����^X%��*���4��)/�yL���
}������8X�)���;�*�'���g��]T		tD�&��oh�����c��n��J��q�r���!��I�SpI\�$��,�T��{���X���?�������7_��m�wW�����'K:|�������^Z���H�H|�4VDVZq���3�7c���J��V��If#�d&�s�OS��J+�y����Y�(���e:���
4���V�7���B<��s`�����H%�*��r
��g���s�����5����HMB�6��������W������'���+{Da�'aH��}[ �����^��^�:�y����������,������N'�����s����x4����=�Eb�����>_]��p��J6������4�$&k<��K���pb����D�V��f5_y{h�X���V���6�z��U�V�W�����h������Z����������*R�PQD�F"�1^.��i�HQ���	��M>d�
�B<��c�iF~����3A��_���^]<=�xuqu!��
���0��`�$Mv���B?���[*(D��Ap��U���l���*z\��I�����w2����&������d���d4���oE~b�����b,��e�Er����#"����������rr��>��<9��J.��V*������<h
���y�HO�i��n��d���>s��tc�.B-���G�.��O&���|3>��W��������4+f&��TD�v6���W{z�'����|�d?.\�d��#��R�-�s��8hFb_nf��P"�A�
]��
�����^���w[,p�v0,���uH�>[s���,6��$_��98���'�����eA����������}6	��rRuV�K*���v5���s������)���C��PKNN|�T�����GM�L�})q�f�^K4�=��i��1g���$ �W�.���paz-�^�O	����XI���6&TO40
��z�*��m����������S���E5�����^����|����.�p�����D-/�W��b��a��a3Z.�z8�&���*(q��8�E�a�������(���G�4����$B���wV1�U������?���>�@=�1 #N-jW������$9@o��[�]���������+�V��yW�������#�Q�g9��pc��A�kWR���B�pZ�Q�;:Iz��^�u���}m�9�{��i�%�����(zA���jA4�����U,EC��Xio+h�S�G���Zr(���QDn�V.��������������I*
 ���.�6i���4�����*yz_�\M�lB���S�Z�b(����{ezM�Z��t
�T�%���U��j��-QD�������eAT�bE�*�f�(T]����g@�
}��[����Z�'c>�����4��rC+!oa��P��bN�~.
l����=n���U�?�1Z����Vc��V-����s�	!j��BH��W�>���%��Xk��T�0��{�_�����.����������:A�GG���-w�q�UI��t�����8�F8rHr��������f����&�Y���87�4�u,\M}��6���/����Zv02<Y|62���#�='�N*%�n��"������������7�u���UE���c���'���x��5�eY�0��7����=���sT��?��G��������d2;�u��������'��������x�l�|�N?�_2�1X�C=ykw��h��~�<��+�R�����x����1]����P����G�.�X�D�GT�OyMDB�i'��T;�F;�������������t��([��1��h��+�c�������^c\e!������.��_W�����"W�P6���I�����nM��W�N���������FL�N�s%����N��,������$�d���=Ft)���)���6;�\��7�p�6�H�t^,��������T:����N|�D���PO�.Yh����P	h� 0�ex��t�p�zE�B����!�c4������`�N�{z	K���;8�`�wIlE����W��urce�|��������8a�_��
����,"������`�U|H��<�������]m�d�	z}]���#�����-7�vr�@�r	u�teNaU�H��b^	�+AK���l����'�GJ�o�w�h���2t�^`'�+�V�k����d�~�\����4A��h��	�����\I�p���$�aT�=t���k�WE�����[E�)�h�,-�m}�dj%�.�9�����xb�a)����@q�)����E�6�3������y�qa������c��]�g
��CJ��e��:����O��>��#)��]�"X��}r����� T��~��3�;���H��:������o�����)o����o�(C����GK�O����9�O{_���q���?E[����;�RMQ1OdR�h{rf�x@��'��������]j� $Y	yl�@WWWW��u��=�(�cv�Q��l�n������@�.�~�������7�%��$�*	�E����Io?�A�22����t|A�S���2
���'M��������=~"}o=����&.��'��s�K�Tc�~���e�������y�s�T�*�
�N�����\�ob����Z�0�E36��B���.#9q��I�]�<_��~d~#����p��H��z�A��;�*��&Q����e�":F������e���c����z�Q&��wiy�f���L��%0a���z��m,��������=D^��"v��������E�kh]1v�^\��j�� %.)���b-QR�eJ�5-)�8�`������W��VP,�kf�l�0
������XX�YL�XX���Z0�\5��9�$A��eIv��HQ{�x��Jvj�]m���/�����%�Za����dWSSoK����q���$�-�M/R�����v�~���]��`��a=��`,N((+A����{���#��N�^r�+�n�A<G�2���AGR
���g�m�d�\��\���l�c�F�zc>l`+*:*�c�����I\���A�Ij<��H�
H���\E����tQ���������oq&����trqq~A�6�>��A�����J�4��������y�����MuP�-���2��3��A{4���I{�?����I}���I��"�h���K�^Z~�f�I�L���_0���0�\�
�(	^����wg8������okj�2�]�Z�=���_D�+TZJJ�
�o����������g>�����>B��$|�4j�G.�_o�v�G:��Z ��)�����F���]3�V�(�,v28k�I-��W���r�?P���������s����k�Z��p�wG�v�;z�Q����:�^�����qc��~��}���}��x��y�F��� 7�%�����Z(��Fj�ht_�����]�O����]�)�
�$�3��|H���U���O����������=��MB����K�WSGHE�JY|����8������s�{�^o6�q�I�?W���a
'�>�n.���������1�*� <LwI<a���I���E�~�)lE[�v��5u�Qk�N[V���b��}$�}������r��l5wQO��d)��)��k�v��H���S�
��;2&`���/���}����33`tN��~��|�jd���KX��^�a��Yzw�%bU����^4�B��LA��:����	9��2PO���?|
��!��\%�R�0���������rdB0�:���W�e�DfKww`��
���VC���]2��l��T�P�J(�J�]v C���*�����h��N��������*����	��2.�����<l�����k-��=n����9Vc��>����
�J������j���/j)}"U�)��w^�s����zU����N�2��c����>��6�n�������0W������c�b!�	�=��::���ZX:��Lm��i�����D~���Ikj��z��}"��VA�{�z����$A�B���o��V�e
�?�`C[�b�j��k#��6���gd��v���
�����9~P��G~��iv�V;h����Io<���}��[����>�o���z�j��D��C��������I��h�����#t8��U��D��3�MBYl�b%�7��cny=@7{�b
8�R[E�3��8��}Cb.(�!�TcN��=W��w\���-��:��aAW���lv��v��Zu~ ����-���z,����J��sv��B��P9?( ��=[�a��S[�Wt'���`:���t�`��~��f���&,�f�(�NO�w`�A�b*�D���%$y�dTr�2\g�����^��Ak����GN�HnE������������+����x�j�(GL�L��
 .��zhZ����	+Mj+�	@�;Q8��>�4\p�5C�$� �1b��6���Q��U
�^*{&R��8s+�jW��^,�OS��|�3�|�,G�-��0���{��W�5�_]�����1e��W���C��yp��v[@N�jO���Tl��$Y+k<���FQ!@C�=�������D������TEU��N�K��d8����u������
a����`;����?_��W���=�[?��S�$n�!��`M
�Xo	���l�<N��'�+��2j�vR�ss��T��)A^u7Fuo��*Y���~�$�r%cV26a%c�<�����[4�"$����A2s���RD��Q%����m�)n��ti����w@�VT�S 
o������M���:v��]���A�p*�:q�8W;��4C��nE/�|:�*����7*��Le�`��"�����Q�LT��)U���@@+�(��U�m��p�k���=�x��-I�]c��B�
M=.i���-f�0/n6�2��s#�I��x~e
H��Cea�B���ZM����T�1Q��C�Qe�a�z��3���X�S�
D���������{T_��j�tq/���g��J������O���TM�� �,��]���*�H��]]��1`�Z�C�s�w�QIH`L�/��_�\r����N��A4�H��u��,�C2��h2����?F�����	0'I�x3�����g����.�����3<����7�_p��0@4�G���������������=!U����vdTy0�>���l���I���Tq	�R�~*�"�Ja����x~3���\�IM8�,"d�}�V�p
���,���n���=��wv~I

�����������SB�t�?����p��|��j���ek�P�O��hK����$�($�BWX�hPZ|(A����eW��~L�!
������v��������!�K�4f]�m�{};�&7��7���M�'���v�a�j�����'���sT\@����k�9������yO�:b�������YZu���M�W�6�S��L��&����������\���/���Y�
��74�l����H��+d"�e3VO�6��,�:S5j4�G�%TW�8�a0�:8��&�D�tb�>%u�N��b��O�������l�����b����H��.e��P�g��= ���;�v����U��c�m��9�
�bvxP�p���4�tJ��jQ�:f����!=vsw�����!�b�"�7�#dQ���B����M{���%��Gl\6hPjs�R�{i�L"l�O��M~����M�/<�w�pl0���P��;i�&�"I�v��Pr����W�������6��~FJ�R[3D�GX���L"�]����$Sg�[�
�%_d��F��v�-��
Ut����\G���j�"XB:�88�t	���	�!��3���u7�Ox����n�=YF'��i�
�y�o6T�Z����@K�Z��!Z�������V���cyM��\�n���0yB���5 B���N�)T����l�;�������i������|)�X��tm",M�����?���	��8�&����O'[m��H/|���d���N���[���"PQ�����B�6�T]jNz�'E�Y�q�G�� xA
|������4�J�6^s�?�W;_^�3!��F����O���PStY���7���lu[��c����9~�5
����l&��h�7��q��jA3��Vo������Nz{G���|�u����Y���������F��>)�s�%=zL���6�4z���H����c@I,�Z*�F����X�)�W&����������{/XF��!^�������#����� 1��c���\�ql���k��>��mVf�����j*
�E�+���@��f��>h�:T5�UJ!��~�2�ng��UX,�G��b�Sl26�,��@��������=��X�Ql�5���?��zK(�����XM����t���[1�����"5�6+�E*��R��]m��V�r�pmY�KlS���!���XC����O���Az�$}�ke����(��;g�����Z�I�0�8Z���_��.�	&Vq������k�Y�u\ z�����jzE�t��]��P�$��<%8"O���!���>��deg�%+���i��h-��G��;TqR���@��;dx�F�
1H�gF9);��� �U7�:���e�A�5��nWs�7&����+T��!���_k��c��`�)U�,l6 x��]E!�H
�nfA1`�+����`����z0n���}���V[��V�ht�B��V����S-�j;+\�~'����vV���=��1�����_����_����G<D���jv��GF�t?H�����/^:V����`R�7��������c�*T7kF��)��������!��p�������<9V��W�}~i��?[m��?�W�����?���v����l����>��?����������v�|<������We�������Vs2�m&�|'q��k�F��F����^����O�{����P��Ui���g���	?���kU�Mp�#��z�Wb#W�}�?=t��0E}@-��AD�mx}����F�������
v|�J���~�
���S$�M�l�*9��N#��Vh�Q0Z��p����0�k�s����~j�?}��z����[�j�:�o�}N��d�-�����
��<��f����z`����������v����V�qp�h��F< ���G�����f���.x����G�'�J��+����O�%m�w�����0W��B���	%bJQ�+���j���vX@W����b��Q�J\ ��
�W�h���g(t�g�g�+�~�W���eT�h�(������r%�O�8���.�>y{b]�
#��^�~��b&�4c�7��"�L��@�	��X��N��	]
2��Q�j�FC����N|�b���O�3�e|�b3�Fb
h���Fb����^���(.��5�2�����0^c�����/^d$�y��p���@�C+R�,����������w�8������Pa�W	�q�$�	�Z�����@��������\) Xd��P��W���p�Ue��:Z�
�����9x�5u���������y5���;�c�,�p,�P�Pf(P�����L���1��~E\����������8�O��K?����9�:��w.�A?j��~�^o�z����q��$��nr���x��o���y�r��+��y�t!����������F���VO��L��j�6����v�����l�^�#��1��JElB�����[
$��{����2U�/Z���S���-�b4	�I�|.ZV�?��\{��S��
�%qi:
N��,�47n���V��#�x1H�2,
D���:VE���a��������I��	�����������S_�����+8��2f��.�	���E�`P5�����<0b/��q�A�J��O�g�O..�������,���1�Sq��A��O�-�
�~�f������cT
��`�lK���b�\���30g�H���6\����:����s�@��l��^�D�j^����;nO���O&��8��Zw'�S�:�w���Z`�w�C���t�/�%�%I����S��j�@��,:���	�z�P�t���3%kH1�����3V��-^�6$?=F3���w'o�/N�����S�
CkQ�Z�O�3�������p"x����a�.�g��+�&1��T�'��L�����>�Y��@��������_PQt����>��������D���(��=�'�������5��@\�m���^{q�����k�
�(/�wY �<������B
���p�*�c*	ro����z����>@�w����Bf/��j,��VZ8��B�Vd����O��F�\�� �
$�H�C::��|��e��8��3�I,��Y)�?51����8d�Dd��P��%U
�����)�����|�&��w�}D�0.�����p>}&�o���z�C@��]2'w��d�t�|%��6��7����_����1�g���6������gx���2�`d�����,��
�J������]�����1)R*Xu�U(X5"Z!|x��l����J8���
&���F��g�Q��"L�T
N��T�4�]��e��z�����1G��k��������d]�2����
���\��c��<=8��v�X�Q���I�Li�D�lFeU�J@����1��W�"=�?�2���F#R)%[��@��6�q!����4�o*�W������K��������G��q�,�j
�I�$s�<+�7���"*W����%���p����9�>�A���� )}�J}s���CZY�&��Y��	|1�b���<wI�$IE�d
*&F�W���NP$���_
�������;�g)�	P�q������_�W���`�d��*?�p�����=��z	���/����;�4)��L3 �����B{cd���*��;{QRo�&�e���om���)�dl�����I����[=��e���������4T���?c?K�U�i���5���6���%V0LO.�W�
�J�]�����7��^
���A��`������#�����;l��g�m�SPdT�*���6lH�3��j����������Y����b%�s/
�����Z��+�������P/n�C>��!�p6�^^���/'W'?��{��*8#��P��e��_1�'�V>u��<|}�}V�	����TQP���
��FR�q|Pc�Dt�v��-�}��Q��H#6��l �A�Ml����v�j�,��M<�S\��/��6O <[DT��� T;����i�[��*w$_���h7�_�����o6'�=��H������:D�1rW���������:�	pK�G��m���"<��M|�������1����'��'��~�?�b��%�w�$��mt���U G)H�j�^~�=<I�
j5�O�S��|J�Z���U��rFS�r4Y����>�}G�?C�4y�3�$d�����	�ju���p�J�${�Uf
��$�9��\�y=�p��l����e�-d1�3CTG���:�g�{2? �zUb��>����A�a�G���t�X2���������&����[�xDQ�1C���p���Ak@�"DF�MbGx;B��o���.I��3j2X,�+�/���n���q
�+����r�
��Qq�Q*9���S'$k�D��U�v�T��t��q�^�;���@h��b�1�
�^$FgBb=�:sXf��s��I��c��WF�7|����,u��rR�����\���dY�n�K�2��	�"X�emK��-�1�K�@H�B�;��s@��+B+�y�~*�<pz��.�����y��Rr������D���IbO�L�5�3q/`������I8-�'C���aW��B�Rhb3���w�J4^�r��^��eYw����$�Uc93�ckY��V�>����"]M������%�I
��
F7F��MF��8���|���o"�pe����,x#Nj��#$	rq�+K��2@p�`O����B6�]C��o~%�R��c�'���f!��B��r���8X���s�����b���� ����yp,|Iiv�8tG�!by&Y�����]6p���gY���������J0Tq��%���=�RX�^���3���p��@�;9����P��=�������d$w�����$�'IZ<��sBk�VI0*k����wK�U(�:���
g��;0�$�)<F)'H�u1��]���q?F��|2� �%r^�[%gW
��x��%�D�&r�������c�Nb&����k+
|hi#���W9�u")�:�+�U����%X��7�K�#�����z3���2�RY���HK�w��'�`�'�l$�h�
n�KS�O�\b%���K�G��U���_���)�tm��vj0���.V���V�):�!��������)�sL����C���QQq$*�$����?���O�5>��1�v��.��V��4��j���s���32�R�	t5K`�nuj�GHm$y��	���s�&q���>��|���]��^�.%�yf�"�eE��x�f*��9q�#3�8B�����:p���1{'�Or)�����A��{�|�9�M��b%?t��r{�G�IV-�L%��T'z�!�)�[���'��;�������Lq���9�N�Q;h@����1T'���B<8ll��@�jpdNbq5�g���NuA\W���X|q>����	�i����fl64>%I�OS�4Me"��`���3��6��b���{�R!r�
R|:%PT:kIg����Sm8����N
8���9�6L�yBi����[G���M�3@�~Qr�d,�%"	��L�d�,b80"`�/��e���	g	9�(hh�2�JC%��G��'0A�p���E��/�:<VWr
4�9��Z}��}��YJ�&�L~�������Mml,���Y<�[����������h0��^u�����v\��0���	�,�6(YD��g)^���u�K����]�:n�nK��mT�7XP��L*�"�LcNJw�����<�b�$�L��H��(����N��sg���Z����i���z�b/?K1%�>�����Zy2�l�x����J�����z��;�A�i��/�B'%]��gJ��<��������������B��^4���
"��9�
\ ����^�l��+�%^"�M1c\�k^��$Iw�8%�D�Gl��(��ds��lZ��Ub�����zf��d���H��?�Lz�����_�F!���-J��,��L����� P��U�=���D�3a�d�����!h�>�����A�����)
,�����u�-XJ%�H�D�}_F�l�eQI�:����l<��5�j������� ��K�)�A������l�	��K\� i�w�M?������C���T�]A���
��i����"��q��!TI�fD�EI&?e��tf��g>���8������p+���k��FPSx���sacV���'���1��l�i���D�Sp�������R(@�b������s��+a�$H����'7�WoI6P���������m���M6
��
g����	��m�������MQ:d�c .���x����b�L#WL�����O&�a�4�I;(���E��xd;T&l��
��2(�����_�������0���p�|���Io��QW+��|�
o���|��_:w��r�s����e�Y�����zr{��+/N��W�{z2a8h8�UY�g�\�����hn�E�a�k*	���X~�_�V3yF� ����*< m�R��M8P�Q?���1�S�l,��.�MC�'����{P�nk{��9��.����T9*=ne����F#Fa=B���+�|�ewB!l+V	:���FU������Z*���B��`���l ���em��'J�QH��4��sPm+������"�7������"��!��UQ+��:���V�Y�1]*TK)�l4@H�u��@����>y�i��y+`� @ ������������/8�@a�(;���n�
Vp��#Y{���Q0m��{�������R9X��N��Sc2K�I%F�3��tt���g�"�O;
N�0b_4���S��z&k�Ktrw��U��0�{��6I�!<Q����x,���i`2�	M���e�*����6��1���������#��&Z��tY~�C���������������TS����+,��nr�4�L�p:C�mQU����/���T{�A\�r����%�G�{�s����&ot:�ai����u
2���bO�����t�5L��sL����x�0�Q�{S	�N���r�y�V-@&��MR��\��J���x��i���T{��!�SO�MV(������:/����Y�)����x$�x���W���Q"D@��*���X5��'�o����|��7��Ef�� sz��*�{]6yNd/3�'G�+z#~����O(pR��^�C�=x�K�q��7$����u��d���u��n�q�f�?P�����
�4l�������O�E�|�0:_��6����c��y�r�-7�'s�"���b��A�x1�/Y���h7"�j�-g�5�nc����B������Mm�y#g�%�H<����3p��-��H��lN9�A��0���q �P�����z�{?�eZTal ���:N�
���C3��$�x\P��#�ih��������I	7$G)JR��f�i�'���i+�"�T���n�
���K#�b~�����>yr7�_A��Cw+��O����g��iW�Jp����Z���|}�AX�$�F#�!�@2������F�>����F����y�S_�(���bS�1������`��x�^j�?�^~�r�W�c67��v��v�����L��n8�z~���a9d��!�HpO�� 45���;_s����X~�Lc�&a���}G�0��Jy������s�6��P��%Y^�tW��g�m����l��W4�/<�5M%}}��xOY�cx�}�����5X	*��2��{����r�]���SK��M�����[]�!�Tp��T��xZ����n�Ts:����9o���M]B�j!��������dxl����kW����z�����Rl�?�������\�;���wX��XL���k�E`���*����%�W��z�`���n��������_����j��'�7D����pk�|<���sQ`,��$����#����S����MT��r%�1(�a"��Lcu�����=�|t�w�r�(cI?�u"��uT��(0$#���Tj�t�����2i#b��T�?����Th)+��1h�m@dP���!gQ#R]����_�l���u��A�I��`n�d���|�
�����
���k�69l#��
F������~��o������3������H��k]�#�b\)����v�p�C����������&�}��/��xu!���%XGi[����X�����n��q�!����j��0#���m{��,K��e�c����e7b��4��

�V��FK���B�P�u��a���-��aX��T|OF(E���h��4B� �u��u�'�
�p4����n�I��)�Z�T�S!<)&i;:��&���y��QF�%B!}��L��lq�9�V86��4����K�l�l��dcZI:��A:� .[O��N5,yG���8�������/�Yd�f��S�l��r�W��v����������O$��"e#�i;@�����SR5$/!!�A�eP$�9~���np`hE"
�9�r\7�D�A$^R�bi�M������4kS
��r���@>a�r�|��f�y���+��X�^�1����.q+�@r$m����+(�]��X5������R����1�O��3>�d��T�e��[�?���&�������H�
�X�n��T�t���Hp�����#@���>�-��a�L���1�O��yf�%�p�����c���
n��������]�j��\.��2�o`~���I�AP��RBQ�~��;R���!�Li��K�(�\�M�r���!�cLDJa�H)�&d""�Ul��jN�;�w��f�IK6��(�E�(�O$M#��h%��oX�~�/����-�#,�91Q[�R)�B$#;��ERgE����+d/h9��2#��3���D3�(���[|���c"l�����l�Q4�2s��y��<�l�	`�$��V*6dG�x����E��~����;��6����[g*���������<^v�F�HX/�X/T�?�yDG[�vc�~��d��o��?gk�H��[��6\�@E�&�}���h���� ;�y{�H��,��B�r��;����qyW��nJ#�L��auP��q��S�OYS<�~�.3�O����x�U���c�$F~s��?g��}��	4����6)9
�	WX1
j���Gj�C@L�N�&��E���U5�����Wfb}����/�
B�L��T�.��HQ<�0����>���3���+
�UP�}>!e��G��le,K@:����u��s���O�|����r�V����8�Z���E�!�dp7
(��i5�NF5��u<U������O!��oP�����`yO���j��UX\���+o"���x�����@{]�+����}'�KW�<�<}
�Xs0�AU�����U�����%��(������������n��jNbr���L�P����L���'�C����C����$�	f���b�wss��M{�Xaj��:E�����!�A����a��� �G�:����n�f@��������!�L��U�7�� M34J�RS"BK��8�: ��v�� ���s����
��2�"N��0���f�Z?�X��3U��/1����N>T�'��'�����9)X	i7P�1�����Q�:�e\��#�%f�a�!�96ZW�.�_J;�����gEr��^�N�.VL`�v!�&!�G��P�u��Y����*�m�/��#����)7�e70�!�y�y@V��JN�����:�s"���X6]�T�g#�� W^3`�$�w����`���8���������^z
���o�MH�c��!`�q<%�CI2�O}{N������Qd�9IZn3NTI��L���Vp���|PqEY�P��<n�"���Rf���(�$�c�����v%���>Dc���b��%I��-&���������������%a��8{�0y� e�9�� ����*��y,������7�h�(�d8�D������M@ �Q���Lz�l>#��hF���\������A��^�PG�8!����k
����z�[��:�����kF�o��-}�����RHf ��Y3I�xG��$*�H)���Y�@�Q���&��8�`�UIt9���Y����HYU��8�������@:|��w�y 3��kYU$����+!#�^�H�!�X����ae��:���0�Jc-��{���FU'����~�"��A�#���=����2�J_�����������������m�[�UP��|����uY������N��}-�ix/�zk�Q����U�:^����T%�|�H*T���Ghp��--����x��T�K�
&6x��A����
/LL�^�i��6���'�U��J8F$�&�Es�p)���m�pJHO���AB*jV�{���.�T������U<GtP����W��B,��Z��d�(��O.~��s	e���C���&��4!�H�.���$�+7$���Hk������j^#�[7_�`|:��-�{��<�K�e@VpG�n@h���r#|$���\��{&�����A{��-GV�!o�A�Z����ET�.�e���?��V��ll[R�"K9��H&D��&?�\g�D�P1A��-��)K�
�V3n��I	&���bP��j���&heWl�F�ZN��J��KE�����A�I.&U:�@�E?��F�JfJ�wk��W���.��V��b������i�������u�#?�T&8������%�tF���<>���e�F�������|{;��7�$V�����D��LhW�I*�������E&`��]����:,P+V-�& ��=�F``�ay�g��Q7�{V���;45t	uD2���3�$�������?�oI;d�dd������B���_k$�?.�9�*E��#�-��GjQe��R=h�U`�"%�����D���f�����=4X)��)� �%�Q��6�`�&�,C���\��q�nXA''����o*F1�����-���F�jn���*�-{��:������R�/������CN1����'��5q���V��sL�!�Q"�!�e�&8���@���h��<C�jTE��P��y�E�\�����U*!����'�I��SPYc���,�bY�e#o	���>,�6���`_kO����u�y��m��%�I�HAyo�N����[���1^�?��_�-4Y���G��w@�����Y��m��HT���rI������p������p���h����'^�a�s�c����P��g�!m�M�z�}0����i-�2\VNyCe�'G������o7C�C���RU�����+KUmK��E�F�sg&_�u|y^��F���GV�pM��Z�N\����"B|��~�5���>�������&)D��4I��+��$zu�K��{�4e6��dA��)(��3<�{N2{��SfU�qv\�T��i]2�76DO�d#���I9t�z�� ����wc,3T�Sh��e���=0�@L��������nd,��U���){�����I��G�F���I��d�< ���o����U�oiKo��5J���c��t�l����I:f\�Q�8���G�"�X�3����c�F &���) �IqL���������i!g���r��%�K�Gs�SS�����'�7$����[�AC�
!��2�?PW���s�D����n��ygV�����I�P,����
u*t���`[�D��}�q0�����L����������t=�����N��	��y�n�K�����Q4g8V��u���L��5�tD)�Dr��<X�	,,��2i/�?���\wV�rxI�lT4�V�5�=}��JR�'�[��������������"��������*��6K�im
L�m��f���MR�&��Rb�ddv��"�%��Wm�e6b\r�EN`9�
�nf���s��
h�����[5suu�!��@�{[x8!��Z�vY
R��b�t;. >s��D�Nj��laW��lI(s/��(��G�V5e���B��(�`����>:A9(�P�#I�g�K��&i�������E���2J{w|GB��d������@d��{��z�v�b����r�p�F��)����.N�������������S�'^_!jC�V&B����1o�����5F}A`��
���C�0��f52�K�`�g�q��������_����c���f.�:Y���/u1u����-+��M�%�R�S	�)� �%���3��B��$e����K8eR��8���!(h���sL ��s���F�eG�j5��cnD�O`�Y���=y��d`�*=��������� s�t��J����=������������ypPm�ww0��Z*��� �1�v=Z������f��aO��,��P�{�D$����������
!������h�^�L��*��*S��8{	��|:cRR��@*RU�z^}��aY����M��$��MC��
j�|���OJ	e���e?��h��,��W �J�M��~o"����h�L
>��'��F[d���`�����`g{D�%��W&�XA�)����d�g(�gX�;�+������A�p&�"ak�����]�������PZ���}
�A�E��DH�rbv�|4Z/#���S
H��������11������*�e����lxU����t�;=��8(���������"	3/+�z�d�����V Bw�9�M!��6@��x��TM���
NuG)����b�pd�����c�����@?9�bD�S	��1C����q2d��3wq/�
�M~��8QB���D_���*��C�(�����Z>��`�A��K���JVdPy��g��g�����O���-aE�H_���P[��U	B2kxu-\�	���>l��&����F�)����"�_��}�O��U�>O_�%e��r��������(cq���M��z�D�����M�.-��M���u[P$o:(��`UY]�-(�86���7���RN~��i1���F��/K�3_Yf:Z���{���������J,���j�� ��C�7��yw%��6eN�R�r�9��
��(���q�Zn\���q���7<���s��}���4;W��h$�q<u>k����f��
J_x2��#k>���r�?�x��*���-�x,x���i����6j�~���R}Z�s�����������Mp��J�����y��
b���!{�s��Mq���4����A��n�hX�������D��+8���m$�:_�Iq�E$!����@�����W���l:UZ��:U:�nuw,N��bE)�M�PR!�@t0��-�M�_�@��.���������Q�1����}=#�*����YZ��9 ��V��������:t�c����*+V�5�SW�iKWTP���L�pMeg�
������)R���6�S��dOu��.�7?N�����/f�YS�c������[v���-T��3��8���r�1P�,N=�
�$?�p�S4�I�Y!U)�����Z�����%��O�3X"�v�r��7d���,�%i��l����mY��IM&��LH�M<����>�Of��3��"�������2�6��a�A��B�VE��k�Iwt`�(U���za�7roV��iw%�}�x�o�� O^����T	��	Vw����s��x8��z7h��U��=���������)�
�t��P��P�-�yq|���W����]\�9=y��dG���;,H�:;}�mZ�}���p���l��(eL��!@��F6�5��k������3�N�Z�ev1�f�U5��C���%T9K|]����02��Cp�L�*8�f�0���G�hY$m6`�n7F��`T�:��p���i��4��!�<��/Zl����U�mA -��?�;�8�]&j��R	�Z,
��_)����I_�U���x����F�?��y�dt��FF���oa��>�?��M��~B!.@�Z��)J��~�8�<�����m"��?��]f0!j���bOJ`B����5�CJ�A�&��)4o�@�F/q.�������V�sj4��x���0�2���m�q�����4��<��DKN������@f��F��&�F�e)�O��Q��A�W�~��=�Q��8�����zTA~��@�[��a�����t���9D�m�h�}L���$Q,��dn�$�����*�
�'X>�m�\���bk�������{A���'������qzO��No�k�Fq����W�x|�-�V�+2M��`��,Xj�,N|�X�P�iUn_��sq��<�/t������yJ	4�B�_�!���[����2����o���X`:�������'��C��U|�,;�]���M�>��k[�Fx��w�'���~k�"6_[�oM]�������Ul��!q�vq_���$�"�QL�s�A�~�~t[���������H(��!:���=k��$�e�_}�m��z��c-�WQ�B��lE��� |�2�����
�^����h5����%�����T]�
�Wz�������%cJY�]�G7:<]������<�U��i�u2���_~m�g���g:>���_���*+O�4�v�Y(��9���N��1�pD+d	e��WF6I5�jjx��&e��b�a7q�D�qV��+qq�g���������"�=��hH�<�$��:}�L�W'�	���������_�s��&��W������M�L�������5kz�i���!�]%�@hh�p_,�	whT�^b���==���w��f|������������-��rO1�e�f�IU�La��&X�+�j���q"���u��G���6���>�.\AB��Iu
%X��-�E
"��4X�g��]���e�d}���PQ
�n]��x>�4��qe�4�oo�qVTe�5�#fD��W���n"B�|���y�`��^����f���������$rY��Q��0����	���j�S�$t#x,��B*x�Li�)7[�)7;�zI7��zREb���pf��1����
���E��_-�s�-�]/�kH[Hc>r��!��%�gM��h�jqv���*{e������ga)��r�%��.�����v1XO�1�h�%"!��V��j	�4�v2u����������/�Y�~��<RA�U�"U��,Q%V�m�5����#D_m8�W�#B��!�G����4�{�b1�� ��o�Pb
L�����d����L2������g��'uM����e���R�b�P)
��+Y1�j=�S[�� �%��@8����&�G&�Ii{����gH���{�-�TB�����/��b�sah����`s�Kjo��i�2
=�W�E���W����gW��w�p�w2X)�e�����k���k����zQ<��L��8����5�@RMu��I�4fN$t�r��.�b�[E�e���3o�1�����7�~�^�w:}��*$d�]e��nc���E�J�:p�E@����l��z[����������	4����8�<����XD!���t<��46���w�#��������'������1y�9���R�%�QC}�[�s�����&���<�����'Y�d?yRx����k�BE���61�����$w�a$����+��@�� �4GK(Iv����&O�����G.#�	��f�eH:=�'�z�q��<�I}k�N��PU��]��_�v�4�@��7x��b� �
�������?WK�r � QSJ���@�k}?y��/�?���g�3}�e�=���?�����x�Y�4��dm��VU�Qv@�^��1I��� erx8��ue������K�]����U��n3!��d�3!k�e�������C���[;.yNi��
��0���5F��Z��e��f�,9��+��p{�q_����5��M�!{������n�=����zX&�V���y$�&�*��&�� B���A��`�j���Lt�S ��v���uADzm!4��$ W_�cL�T��@d�!itUW`��?A���~�=�(s[�rt��c�x@)������7�<�`N��(����&�V2u$91R�2<7*��+k^��c$,�%�Go��Q��/O#�����������]�K(WT���!�C��`x�7�����rO�_��hX�<��6
�01�F�5c�?���w�^s�x��->�;��>����@���/��<��M����W��K0_9]ot�I��xr��5��A�����"j��S���������������y�����,�!���k��gx4`��r&%,�>7���F�`
4�Y���f'�;�k��E=1�P��4�2��*r��q�/�jH[��
kV�;?�d���~� ���<��(������;���	�>d�;vN*��3���I{6��n����S�����F�����z���	��� ��������De�s��%�=bE=�R����M�h���T�Lg��UP_�_��z'8No�C���m�^�K�#2��I!'�n���O�}����{���|���\!(>��H���3�L`�s�#G��|�(e�~��9����x�����a;�t�	Q���!�)�Q���������w�a��H���Y<�}}H�~$�U8[�eZ�Y���W���e�32u;�k_gA#��y���	�`"jV�1D?��( ��
�%P�T�{'>��Q��@��|$X��mt�>�~����B�s�>@�e�t�D�1J�?�%�S�'�����,\p�����#����S��7��l!%��J����i��"����4d����u5��g���.�\)�g��k���)Auj�$����h�B��x�%�.
��?�$��x�r�kp6&#u�1
�4����z-e���������$�u���|�W����<-G2eV(�N�"�dL��y���B\'��:q��'�J���>1dIrZ����b�?v]����������`8H�qH�!���m�D������m��)��x�TV��Q�^�����	-�Q8	��+�"����d�HX�d���
�OG�z����(���cv{"�HG�?$��w�-R��VJ��h/I�>��K�_�r��z�����!��&�Ib1
�;��	�!�J�Y���3��o1�� n�9?y��1�o\�jp'L��z��q'C����F�f
����&���S� <�b�c8����\�5������G�gD��i��k�t��Kw&z��a^�R0<,�����fO��M(��rts��I��]�:���)�e;p���a.�8VB����K,p��)��'�>���B�����f�`0<����d���Q��AM�N��,x��C��?Ra*����F5`2
�;I��;��!�FTf(��L��p`7[��th���4�s��xl�[���(J���e8�=��U�_�/�<R^�O��
h'������IM�Flu{���^&�f���;�o��f���t�L�9f�*nX�&�/n�J���B��"d���T"[����p�D��t���BD�����
H������A �c�X�9/�Z�|��BI7�%`\;��zm&�9������|Z�zP?EN| �c��0i:
��e@@C������J���@��5z�PU��x�`�B�cz�k&mt��(��z�1���~/N~�i�mJ��6�w��jd�����s��#��(��~:z���{���z������g(D%!^_��������
o����<�7c���p�u�G������ll��+��Yp�h�5b�R�7(�8�J{W�0�qQjv�m@�������c���,�O��&�;��a��U���Bp��2��p����N]��|��� �ALv�����W�K`X��B<��Q�7a�
�`F#�`-�8v��QD1`d�#�&]��D��Z
hq|q�A��,����aL��C�Do^�?�y�,0X�X�8~�x����������/�O�.YC��W�U�o�	���	����}��5��(�w#������E�
�{k�~:=�Y,6,�w��{����0�.��{������S�(��������u)�a����E�1{�S��cW%��=Z���O%)�7�wMPUM���/?D�+��|6L.gf�.������
�����i2���<EE�X��?EKv�����h���k6�e�5�m\���6����M�k���%>��

�f�~��{7'��fr{?J���-���j����{5f�v���{/=qD�"=|����X\���|x1��H�S��������b�G�^������[��1�-�!*�e1�!�Z]���C�26����A��f�����`��������Vbc�|��U-JE	�l���S��"1���@X�O��%KC��J�������9�^��zvM���@�"����D�:��
������`�J?y�H��!��^������u�Ac!>q��'^��8�?����������x#�[z��c���Dzzmvmju���+�j���D�������K������&�d���s�!�N��#���H�4WP�-����G��)���a\���9�W�5������3�?�u�	�|��+W�7
`e
���l	��s�k������]:�z��}������"4��+������s6�/^�u4������?>�X����8eW����C��\���`������S_�&O��5�����?���X�_�"������O��u�
�O��?E��	C}6� M�sS�c������E��l'i�����Z�b��m��v}��e���S��� ��^V��n7�v���e�M����a���Ca��v�����h'�|2){i2����y�m�Z����~P�{�����:aSt{�L�p��c���K���\#H�(���k��I��U�\��>�U�X{�y��_��#����&��B�����A������~�w~@G��<�s.��1������1��@G�V�����x%�x�
�Tq�	]7�4l����+{/l�Q��m5�����Vy�+�j�Du���Tr�<����5T���[�s0�;�6F�S�����1"b�PGOY%����aT��T�����Yp�`�:[4IM>��q��\E�8(S�j67_��B%�4F�1�U2�� Z���������$�1�~6�I�3(N�xxj:�:6xp@��?��'s%��t��#� #O
�<�Ly��2�C����:~lu��m|���RGo�M��rOA���f��j���-���d�kc;�����g�������v �bb�N9j�n=0�����Re�-4A�o!��C�;�J�[���
`��y+PE�Z���Y��8��/6��(������h��c��?�FG|�������|�e��m6T�~�}��M�f��]��D�^?y���9{��H��"��b;��I�F08�r���8����c��Cr��``X;�j�V-�V�)�v�N�������:F-:zJ>��a>*�M���d�c�8n��o��*�^����ZC68�%�y7gy~������2��=���|��H����E��U��KA+B���������������f>E����zE<�|�etMg�pk8
Q��?��h�RX����	������$�L��M&�'H_P��[m��es���Z���@���Vlmim/���*SV�]��LY�v}�.���$
�;Z�`�X����Fi������Z��p�Tf��<���[E/k���/��_��/���$1�����"��G������x����R����c\����\��p�tI�DXs����0�+����UO��}m���\9Q6+�W��H��Y�w�,������`���?n�w�5���
��` �a�Y�#�r�������gUj����t�����z��Dc,a��g�'{����A���)�_���s9(�.wv~yz|"�zMI��l����z�f��S�:�W[��`{
b�U���y�M��~������zNG=��_�c'<p9a�Q��r��~Os�L��x�c&�4�FOfb�`G��}�a]���$?���>��L������J����+���o���9(
�X�A}wX<�}xN}�]�q��R�>������Po��0�������[?q��"��E�5�/����(�����t�������S��u��d��-�'�v���NL�K6�u��d=�������p���zOk��WB{`.3�O+��B30���7�g�H3+�M�\��Z7�GA=.&��h�-�&��b3�D��������#�oq�UJ����!�_��?�!&_&q���jDg�0Lb% �L'��S,d�83Ix��g��X�m��v�d$��V������2�	d����������!e��AA�������S�+]��c%M
?���Oo��G3Cq2�$?I��-��)
����Y6�:�*�����O�Qh[2���
,{����'I�'g���������N�z�O��8�����Brf+Yd�kK��/Gg�������Z��#����������	�8v������������t���S&��PBH�r�7�e��b5t�Z����Z������������^#���`V{��������^�\���!��XKi}�J2v���'r�J�P�$��L�G���4���1D@1N#�(A������2�B�Z�12e�GWD�iG�3{	$�1<
>S��h�Wr�(a	�������U�P���Js��&t��D9�.��F,��a|���� D=��A3�_���2
<��
j��Xdy��$_7�z��1�2�"��H�?{��)=N���]J�Sr��5;����_��fS�n��8�����������v2�U������3�*��M��4��hT:[[;:����j�����2WR;m��{'�//N�/�I�p���g�@%>���?��%�b� ���;�5Q�7�Q�w�w@��n�O�(��Z�&����4��%�W]x������
]��:�ODp���P�IpV/a�2q�Z���/^X��P�<<���<Z��T3��>/����71�~���}��#�(gm��3$�^����j{�o]�/A�P*�e���'���,�_D����qG�j��tZ*G5 �1�+��=!�������?PE�����������v�vL�^��
y�
�S��<�L��Q�?��g���<3����3���S�+�3�Q3o
,����C</g�w������
��������M=K���q������L���+6��C�
�[t��2o��s�-:��{`�����`y��8�1�@�tz�4����{gP��&�d�A%��:��m:=�1��?�����OM����g���(���!9���EhY�BB��9ae��a���;�#3�eaxZ;#�j=NI�Kd�<���I�Hk�M��pI��E����jn�g�y�
wL|(���U�9YFO��Q}�N�Ow����n���d���M��b=�R]��T�zL8b� r%l��#������������{��/A_�?�U$k�@�h�)5`��A�_���
Vf#���~E��K������N8c�-^����@��$�<�:��>��M�XjH���j�*)�������+qr�X��^9B����5��9k{��o������oIo�u ��2�Gyl�as�A�wza���'o�~|{�W�mt���$D$�T�������D����W&|-��Ea��������8��w~�e5o�[v��f�'6w�?{�O=�M�%�����fC�7/Vc��VJI/��0����0Z�}��>`]����;������nc!�^�Q2����F�pi�[�t�}�m��7��Mc��H��8&��J\����k����_i���GS�d������<�����~���9�f>�I���])��=�`k\�
	�n=�c�����WHS�8�X$^����xl�hi���Y��c�^ �c�0���������9�����:=�3H�h��k��p���&i{�h�F�F�������
�7�_QQ��D0����q*x��_�O�
7����3!�=���c��U���v��DT�l��}2��q//���e]J!Rx0�I��1��W�z����n#�S��%�"�-�l.��h������%�f�Xz9mA(���8���&�##@�_�(�)�18�h=�]Y&��#`�5��v�,Y4J	�uhR���Q����I{pk�n�n��&���,���y�=Z�{v����=I��U�V��"-��^�ki����6�6lU5,0�hH��F������
�!����A�/ED�>��� �������R��[/�$�|��>c|�;��;�-WSd�H"�X��L�&K������q,PyL���F�R���:���
����+[T)���#}*#����*��p��K�!pPu��i����b�oc���3���:����bo���������)������2�W$Pwz�iv]c�/����FHqv�^�A:���6����l~����h$����}	B���K���~�R�u��.a.j��r��Ti�.��+���U@T=����t��-��L(C�-o��
��K���j��	��/�`�UI��w�u���q~f��������DP7-aA���p���M��!r���8�?�.fHP�p�����
1�����
W�U{OHL���)6������f�!�2����D�����d�NMC|�7�%��f;��l'��`.L%��gsT�0�TL�`3u x�q��-�ht+�����e�5�*���ZO�`�	ZC���.@nX����D�^�g#.ku�+k`w�>J�n�e �2���P��1�d���|A����i�$S��Q-�J���(E�U�-V�z�24�o�^iVS_��b�[Qz�f��+��
b/�h-��[.m)h'����nT?U�@?���F�G�QC1���L3Ilv���������}~U!st���d��	���t�oV^��5���&y��{v�����2����JjW�/;�h����P5����;T\
VL���c��rKp�3I���U�+��=B��/�.�(�E�O�h��*�v.����^�^��G=���]���#����
K96��V����v���7.��}�R�����c9��r�q���A9�����Vb�Q���f��|,%�XJ�s������r��?CA�/X������j=���E�����H��)J��MJ�|��7��"�'���Y5N���8o�E*^<����5/���_9��k<�<�m���~o`��p��p���p����+��D�����;��Cf~D�9�N�N�����c��A�a���O���)��#��#�����Q��z-� ��z��#�u!�#�u�3���$��#�#��%�������A�A�7YN�Y��&�k�b� oADY-����j���s_%v��;A������@����V��[&�|-��_9�+���x��9@����,����l;�4/6����UC��;��}i���@���P�j��(��#8���)��#��FV� V�N V��B=CY���0T�R����l�����Z[� �R�zPwi'XH���������������������������������������������������l���

#221

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: Amit Khandekar (#220)

Re: [HACKERS] UPDATE of partition key

On 3 January 2018 at 11:42, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

[...] So it make perfect sense for
ExecSetupPartitionTupleRouting() to palloc and return a pointer. Sorry
for the noise. Will share the change in an upcoming patch version.
Thanks !

ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.

Thanks for changing. I've just done almost a complete review of v32.
(v33 came along a bit sooner than I thought).

I've not finished looking at the regression tests yet, but here are a
few things, some may have been changed in v33, I've not looked yet.
Also apologies in advance if anything seems nitpicky.

1. "by INSERT" -> "by an INSERT" in:

from the original partition followed by <command>INSERT</command> into the

2. "and INSERT" -> "and an INSERT" in:

a <command>DELETE</command> and <command>INSERT</command>. As far as

3. "due partition-key change" -> "due to the partition-key being changed" in:

* capture is happening for UPDATEd rows being moved to another partition due
* partition-key change, then this function is called once when the row is

4. "inserted to another" -> "inserted into another" in:

* deleted (to capture OLD row), and once when the row is inserted to another

5. "for UPDATE event" -> "for an UPDATE event" (singular), or -> "for
UPDATE events" (plural)

* oldtup and newtup are non-NULL. But for UPDATE event fired for

I'm unsure if you need singular or plural. It perhaps does not matter.

6. "for row" -> "for a row" in:

* movement, oldtup is NULL when the event is for row being inserted,

Likewise in:

* whereas newtup is NULL when the event is for row being deleted.

7. In the following fragment the code does not do what the comment says:

/*
* If transition tables are the only reason we're here, return. As
* mentioned above, we can also be here during update tuple routing in
* presence of transition tables, in which case this function is called
* separately for oldtup and newtup, so either can be NULL, not both.
*/
if (trigdesc == NULL ||
(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
return;

With the comment; "so either can be NULL, not both.", I'd expect a
boolean OR not an XOR.

maybe the comment is better written as:

"so we expect exactly one of them to be non-NULL"

(I know you've been discussing with Robert, so I've not checked v33 to
see if this still exists)

8. I'm struggling to make sense of this:

/*
* Save a tuple conversion map to convert a tuple routed to this
* partition from the parent's type to the partition's.
*/

Maybe it's better to write this as:

/*
* Generate a tuple conversion map to convert tuples of the parent's
* type into the partition's type.
*/

9. insert should be capitalised here and should be prefixed with "an":

/*
* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.
*/
CheckValidResultRel(leaf_part_rri, CMD_INSERT);

Maybe it's better to write:

/*
* Verify result relation is a valid target for an INSERT. An UPDATE of
* a partition-key becomes a DELETE/INSERT operation, so this check is
* still required when the operation is CMD_UPDATE.
*/

10. The following code would be more clear if you replaced
mtstate->mt_transition_capture with transition_capture.

if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
&& mtstate->mt_transition_capture->tcs_update_new_table)
{
ExecARUpdateTriggers(estate, resultRelInfo, NULL,
NULL,
tuple,
NULL,
mtstate->mt_transition_capture);

/*
* Now that we have already captured NEW TABLE row, any AR INSERT
* trigger should not again capture it below. Arrange for the same.
*/
transition_capture = NULL;
}

You are, after all, doing:

transition_capture = mtstate->mt_transition_capture;

at the top of the function. There are a few other places you're also
accessing mtstate->mt_transition_capture.

11. Should tuple_deleted and process_returning be camelCase like the
other params?:

static TupleTableSlot *
ExecDelete(ModifyTableState *mtstate,
ItemPointer tupleid,
HeapTuple oldtuple,
TupleTableSlot *planSlot,
EPQState *epqstate,
EState *estate,
bool *tuple_deleted,
bool process_returning,
bool canSetTag)

12. The following comment talks about "target table descriptor", which
I think is a good term. In several other places, you mention "root",
which I take it to mean "target table".

* This map array is required for two purposes :
* 1. For update-tuple-routing. We need to convert the tuple from the subplan
* result rel to the root partitioned table descriptor.
* 2. For capturing transition tuples when the target table is a partitioned
* table. For updates, we need to convert the tuple from subplan result rel to
* target table descriptor, and for inserts, we need to convert the inserted
* tuple from leaf partition to the target table descriptor.

I'd personally rather we always talked about "target" rather than
"root". I understand there's probably many places in the code
where we talk about the target table as "root", but I really think we
need to fix that, so I'd rather not see the problem get any worse
before it gets better.

The comment block might also look better if you tab indent after the
1. and 2. then on each line below it.
Also the space before the ':' is not correct.

13. Does the following code really need to palloc0 rather than just palloc?

/*
* Build array of conversion maps from each child's TupleDesc to the
* one used in the tuplestore. The map pointers may be NULL when no
* conversion is necessary, which is hopefully a common case for
* partitions.
*/
mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);

I don't see any case in the initialization of the array where any of
the elements are not assigned a value, so I think palloc() is fine.

14. I don't really like the way tupconv_map_for_subplan() works. It
would be nice to have two separate functions for this, but looking a
bit more at it, it seems the caller won't just need to always call
exactly one of those functions. I don't have any ideas to improve it,
so this is just a note.

15. I still don't really like the way ExecInitModifyTable() sets and
unsets update_tuple_routing_needed. I know we talked about this
before, but couldn't you just change:

if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

To:

if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
node->partitioned_rels != NIL &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

and get rid of:

/*
* If it's not a partitioned table after all, UPDATE tuple routing should
* not be attempted.
*/
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
update_tuple_routing_needed = false;

looking at inheritance_planner(), partitioned_rels is only set to a
non-NIL value if parent_rte->relkind == RELKIND_PARTITIONED_TABLE.

16. "named" -> "target" in:

* 'partKeyUpdated' is true if any partitioning columns are being updated,
* either from the named relation or a descendent partitioned table.

I guess we're calling this one of; root, named, target :-(

17. You still have the following comment in ModifyTableState but
you've moved all those fields out to PartitionTupleRouting:

/* Tuple-routing support info */

18. Should the following not be just called partKeyUpdate (without the 'd')?

bool partKeyUpdated; /* some part key in hierarchy updated */

This occurs in the planner were the part key is certainly being updated.

19. In pathnode.h you've named a parameter partColsUpdated, but the
function in the .c file calls it partKeyUpdated.

I'll try to look at the tests tomorrow and also do some testing. So
far I've only read the code and the docs.

Overall, the patch appears to look quite good. Good to see the various
cleanups going in like the new PartitionTupleRouting struct.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#222

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#219)

Re: [HACKERS] UPDATE of partition key

Robert, for tracking purpose, below I have consolidated your review
items on which we are yet to conclude. Let me know if you have more
comments on the points which I made.

------------------
1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
------------------

+ /*
+ * UPDATEs set the transition capture map only when a new subplan
+ * is chosen.  But for INSERTs, it is set for each row. So after
+ * INSERT, we need to revert back to the map created for UPDATE;
+ * otherwise the next UPDATE will incorrectly use the one created
+ * for INESRT.  So first save the one created for UPDATE.
+ */
+ if (mtstate->mt_transition_capture)
+ saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
I wonder if there is some more elegant way to handle this problem.
Basically, the issue is that ExecInsert() is stomping on
mtstate->mt_transition_capture, and your solution is to save and
restore the value you want to have there. But maybe we could instead
find a way to get ExecInsert() not to stomp on that state in the first
place. It seems like the ON CONFLICT stuff handled that by adding a
second TransitionCaptureState pointer to ModifyTable, thus
mt_transition_capture and mt_oc_transition_capture. By that
precedent, we could add mt_utr_transition_capture or similar, and
maybe that's the way to go. It seems a bit unsatisfying, but so does
what you have now.
In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

So, overall, it would not work, and even if we make it work by passing
or storing some more information somewhere, the
AfterTriggerSaveEvent() logic will become too complicated.

So I can't think of anything else, but to keep the way I did, i.e.
reverting back the tcs_map once insert finishes. We so a similar thing
for reverting back the estate->es_result_relation_info.

------------------
2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
------------------

+ * If per-leaf map is required and the map is already created, that map
+ * has to be per-leaf. If that map is per-subplan, we won't be able to
+ * access the maps leaf-partition-wise. But if the map is per-leaf, we
+ * will be able to access the maps subplan-wise using the
+ * subplan_partition_offsets map using function
+ * tupconv_map_for_subplan().  So if the callers might need to access
+ * the map both leaf-partition-wise and subplan-wise, they should make
+ * sure that the first time this function is called, it should be
+ * called with perleaf=true so that the map created is per-leaf, not
+ * per-subplan.
This sounds complicated and fragile. It ends up meaning that
mt_childparent_tupconv_maps is sometimes indexed by subplan number and
sometimes by partition leaf index, which is extremely confusing and
likely to lead to coding errors, either in this patch or in future
ones.

I am more inclined towards avoiding an always-leaf-partition-indexed
map for additional reasons mentioned below ...

Would it be reasonable to just always do this by partition leaf
index, even if we don't strictly need that set of mappings?

------------------
3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps
------------------

Likewise, I'm not sure I get the point of mt_transition_tupconv_maps
-> mt_childparent_tupconv_maps. That seems like it could similarly be
left alone.

-------------------
4. Explicit signaling for "we are only here for transition tables"
-------------------

+ /*
+ * If transition tables are the only reason we're here, return. As
+ * mentioned above, we can also be here during update tuple routing in
+ * presence of transition tables, in which case this function is called
+ * separately for oldtup and newtup, so either can be NULL, not both.
+ */
if (trigdesc == NULL ||
(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
- (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+ (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+ (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))

But then decided that I can continue on similar lines and add another
such condition to indicate that we are only capturing update-routed
tuples.

#223

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: David Rowley (#221)

Re: [HACKERS] UPDATE of partition key

On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote:

I'll try to look at the tests tomorrow and also do some testing. So
far I've only read the code and the docs.

There are a few more things I noticed on another pass I made today:

20. "carried" -> "carried out the"

+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row

21. Extra new line

+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
   </para>

22. In copy.c CopyFrom() you have the following code:

/*
* We might need to convert from the parent rowtype to the
* partition rowtype.
*/
map = proute->partition_tupconv_maps[leaf_part_index];
if (map)
{
Relation partrel = resultRelInfo->ri_RelationDesc;

tuple = do_convert_tuple(tuple, map);

/*
* We must use the partition's tuple descriptor from this
* point on. Use a dedicated slot from this point on until
* we're finished dealing with the partition.
*/
slot = proute->partition_tuple_slot;
Assert(slot != NULL);
ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
ExecStoreTuple(tuple, slot, InvalidBuffer, true);
}

Should this use ConvertPartitionTupleSlot() instead?

23. Why write;

last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans;

when you can write;

last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans];?

24. In ExecCleanupTupleRouting(), do you think that you could just
have a special case loop for (mtstate && mtstate->operation ==
CMD_UPDATE)?

/*
* If this result rel is one of the UPDATE subplan result rels, let
* ExecEndPlan() close it. For INSERT or COPY, this does not apply
* because leaf partition result rels are always newly allocated.
*/
if (is_update &&
resultRelInfo >= first_resultRelInfo &&
resultRelInfo < last_resultRelInfo)
continue;

Something like:

if (mtstate && mtstate->operation == CMD_UPDATE)
{
ResultRelInfo *first_resultRelInfo = mtstate->resultRelInfo;
ResultRelInfo *last_resultRelInfo =
mtstate->resultRelInfo[mtstate->mt_nplans];

for (i = 0; i < proute->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = proute->partitions[i];

/*
* Leave any resultRelInfos that belong to the UPDATE's subplan
* list. These will be closed during executor shutdown.
*/
if (resultRelInfo >= first_resultRelInfo &&
resultRelInfo < last_resultRelInfo)
continue;

ExecCloseIndices(resultRelInfo);
heap_close(resultRelInfo->ri_RelationDesc, NoLock);
}
}
else
{
for (i = 0; i < proute->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = proute->partitions[i];

ExecCloseIndices(resultRelInfo);
heap_close(resultRelInfo->ri_RelationDesc, NoLock);
}
}

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#224

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#220)

Re: [HACKERS] UPDATE of partition key

On Wed, Jan 3, 2018 at 6:29 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.

Did this change in v3 version of
0001-Encapsulate-partition-related-info-in-a-structure.patch

I'll have to come back to some of the other open issues, but 0001 and
0005 look good to me now, so I pushed those as a single commit after
fixing a few things that pgindent didn't like. I also think 0002 and
0003 look basically good, so I pushed those two as a single commit
also. But the comment changes in 0003 didn't seem extensive enough to
me so I made a few more changes there along the way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#225

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: David Rowley (#223)

2 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 5 January 2018 at 03:04, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 3, 2018 at 6:29 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

ExecSetupPartitionTupleRouting() now returns PartitionTupleRouting *.

Did this change in v3 version of
0001-Encapsulate-partition-related-info-in-a-structure.patch

I'll have to come back to some of the other open issues, but 0001 and
0005 look good to me now, so I pushed those as a single commit after
fixing a few things that pgindent didn't like. I also think 0002 and
0003 look basically good, so I pushed those two as a single commit
also. But the comment changes in 0003 didn't seem extensive enough to
me so I made a few more changes there along the way.

Thanks. Attached is a rebased update-partition-key_v34.patch, which
also has the changes as per David Rowley's review comments as
explained below.

The above patch is to be applied over the last remaining preparatory
patch, now named (and attached) :
0001-Refactor-CheckConstraint-related-code.patch

On 3 January 2018 at 19:22, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've not finished looking at the regression tests yet, but here are a
few things, some may have been changed in v33, I've not looked yet.
Also apologies in advance if anything seems nitpicky.

No worries. In fact, it's good to do this right now, otherwise it's
difficult to notice and fix at later point of time. Thanks.

1. "by INSERT" -> "by an INSERT" in:

from the original partition followed by <command>INSERT</command> into the

2. "and INSERT" -> "and an INSERT" in:

a <command>DELETE</command> and <command>INSERT</command>. As far as

3. "due partition-key change" -> "due to the partition-key being changed" in:

* capture is happening for UPDATEd rows being moved to another partition due
* partition-key change, then this function is called once when the row is

4. "inserted to another" -> "inserted into another" in:

* deleted (to capture OLD row), and once when the row is inserted to another

5. "for UPDATE event" -> "for an UPDATE event" (singular), or -> "for
UPDATE events" (plural)

* oldtup and newtup are non-NULL. But for UPDATE event fired for

I'm unsure if you need singular or plural. It perhaps does not matter.

6. "for row" -> "for a row" in:

* movement, oldtup is NULL when the event is for row being inserted,

Likewise in:

* whereas newtup is NULL when the event is for row being deleted.

Done all of the above.

7. In the following fragment the code does not do what the comment says:

/*
* If transition tables are the only reason we're here, return. As
* mentioned above, we can also be here during update tuple routing in
* presence of transition tables, in which case this function is called
* separately for oldtup and newtup, so either can be NULL, not both.
*/
if (trigdesc == NULL ||
(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
return;

With the comment; "so either can be NULL, not both.", I'd expect a
boolean OR not an XOR.

maybe the comment is better written as:

"so we expect exactly one of them to be non-NULL"

Ok. Made it : "so we expect exactly one of them to be NULL"

(I know you've been discussing with Robert, so I've not checked v33 to
see if this still exists)

Yes, it's not yet concluded.

8. I'm struggling to make sense of this:

/*
* Save a tuple conversion map to convert a tuple routed to this
* partition from the parent's type to the partition's.
*/

Maybe it's better to write this as:

/*
* Generate a tuple conversion map to convert tuples of the parent's
* type into the partition's type.
*/

This is existing code; not from my patch.

9. insert should be capitalised here and should be prefixed with "an":

/*
* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.
*/
CheckValidResultRel(leaf_part_rri, CMD_INSERT);

Maybe it's better to write:

/*
* Verify result relation is a valid target for an INSERT. An UPDATE of
* a partition-key becomes a DELETE/INSERT operation, so this check is
* still required when the operation is CMD_UPDATE.
*/

Done. Instead of DELETE/INSERT, used DELETE+INSERT.

10. The following code would be more clear if you replaced
mtstate->mt_transition_capture with transition_capture.

if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
&& mtstate->mt_transition_capture->tcs_update_new_table)
{
ExecARUpdateTriggers(estate, resultRelInfo, NULL,
NULL,
tuple,
NULL,
mtstate->mt_transition_capture);

/*
* Now that we have already captured NEW TABLE row, any AR INSERT
* trigger should not again capture it below. Arrange for the same.
*/
transition_capture = NULL;
}

You are, after all, doing:

transition_capture = mtstate->mt_transition_capture;

at the top of the function. There are a few other places you're also
accessing mtstate->mt_transition_capture.

Actually I wanted to be able to have a temporary variable that has
it's scope only for ExecARInsertTriggers(). But because that wasn't
possible, had to declare it at the top. I feel if we use
transition_capture all over, and if some future code below the NULL
assignment starts using transition_capture, it will wrongly get the
left-over NULL value.

Instead, what I have done is : used a special variable name only for
this purpose : ar_insert_trig_tcs, so that code won't use this
variable, by looking at it's name. And also moved it's assignment down
to where it is used the first time.

Similarly for ExecDelete(), used ar_delete_trig_tcs.

11. Should tuple_deleted and process_returning be camelCase like the
other params?:

static TupleTableSlot *
ExecDelete(ModifyTableState *mtstate,
ItemPointer tupleid,
HeapTuple oldtuple,
TupleTableSlot *planSlot,
EPQState *epqstate,
EState *estate,
bool *tuple_deleted,
bool process_returning,
bool canSetTag)

Done.

12. The following comment talks about "target table descriptor", which
I think is a good term. In several other places, you mention "root",
which I take it to mean "target table".

* This map array is required for two purposes :
* 1. For update-tuple-routing. We need to convert the tuple from the subplan
* result rel to the root partitioned table descriptor.
* 2. For capturing transition tuples when the target table is a partitioned
* table. For updates, we need to convert the tuple from subplan result rel to
* target table descriptor, and for inserts, we need to convert the inserted
* tuple from leaf partition to the target table descriptor.

I'd personally rather we always talked about "target" rather than
"root". I understand there's probably many places in the code
where we talk about the target table as "root", but I really think we
need to fix that, so I'd rather not see the problem get any worse
before it gets better.

Not very sure if that's true at all places. In some contexts, it makes
sense to use root to emphasize that it is the root partitioned table.
E.g. :

+ * For ExecInsert(), make it look like we are inserting into the
+ * root.
+ */
+ Assert(mtstate->rootResultRelInfo != NULL);
+ estate->es_result_relation_info = mtstate->rootResultRelInfo;

+ * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+ * should convert the tuple into root's tuple descriptor, since
+ * ExecInsert() starts the search from root.  The tuple conversion

The comment block might also look better if you tab indent after the
1. and 2. then on each line below it.

Used spaces instead of tab, because tab was taking it too much away
from the numbers, which looked odd.

Also the space before the ':' is not correct.

Done

13. Does the following code really need to palloc0 rather than just palloc?

/*
* Build array of conversion maps from each child's TupleDesc to the
* one used in the tuplestore. The map pointers may be NULL when no
* conversion is necessary, which is hopefully a common case for
* partitions.
*/
mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);

I don't see any case in the initialization of the array where any of
the elements are not assigned a value, so I think palloc() is fine.

Right. Used palloc().

14. I don't really like the way tupconv_map_for_subplan() works. It
would be nice to have two separate functions for this, but looking a
bit more at it, it seems the caller won't just need to always call
exactly one of those functions. I don't have any ideas to improve it,
so this is just a note.

I am assuming you mean one function for the case where
mt_is_tupconv_perpart is true, and the other function when it is not
true. The idea is, the caller should not have to worry if the map is
per-subplan or not.

15. I still don't really like the way ExecInitModifyTable() sets and
unsets update_tuple_routing_needed. I know we talked about this
before, but couldn't you just change:

if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

To:

if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row &&
node->partitioned_rels != NIL &&
operation == CMD_UPDATE)
update_tuple_routing_needed = true;

and get rid of:
.....
if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
update_tuple_routing_needed = false;

looking at inheritance_planner(), partitioned_rels is only set to a
non-NIL value if parent_rte->relkind == RELKIND_PARTITIONED_TABLE.

Initially update_tuple_routing_needed can be already true because of :
bool update_tuple_routing_needed = node->partKeyUpdated;

So if it's not a partitioned table and update_tuple_routing_needed is
set to true due to the above declaration, the variable will remain
true if we don't check the relkind in the end, which means the final
conclusion will be that update-tuple-routing is required, when it is
really not. Now, I understand that node->partKeyUpdated will not be
true if it's a partitioned table, but I think we better play safe
here. partKeyUpdated as per its name implies whether any of the
partition key columns are updated; it does not imply whether the
target table is a partitioned table or just a partition.

16. "named" -> "target" in:

* 'partKeyUpdated' is true if any partitioning columns are being updated,
* either from the named relation or a descendent partitioned table.

I guess we're calling this one of; root, named, target :-(

Changed it to:
* either from the target relation or a descendent partitioned table.

17. You still have the following comment in ModifyTableState but
you've moved all those fields out to PartitionTupleRouting:

/* Tuple-routing support info */

This comment applies to mt_partition_tuple_routing field.

18. Should the following not be just called partKeyUpdate (without the 'd')?

bool partKeyUpdated; /* some part key in hierarchy updated */

This occurs in the planner were the part key is certainly being updated.

Actually the way it is named, it can mean : the partition key "is
updated" or "..has been updated" or "..is being updated" all of which
make sense. This sounds consistent with the name
RangeTblEntry->updatedCols that means "which of the columns are being
updated".

19. In pathnode.h you've named a parameter partColsUpdated, but the
function in the .c file calls it partKeyUpdated.

Renamed partColsUpdated to partKeyUpdated.

I'll try to look at the tests tomorrow and also do some testing. So
far I've only read the code and the docs.

Thanks David. Your review is valuable.

20. "carried" -> "carried out the"

+       would have identified the newly updated row and carried
+       <command>UPDATE</command>/<command>DELETE</command> on this new row

Done.

21. Extra new line

+   <xref linkend="ddl-partitioning-declarative-limitations">.
+
</para>

Done.

I am not sure when exactly, but this line has started giving compile
errors, seemingly because > should be />. Fixed it.

22. In copy.c CopyFrom() you have the following code:

/*
* We might need to convert from the parent rowtype to the
* partition rowtype.
*/
map = proute->partition_tupconv_maps[leaf_part_index];
if (map)
{
Relation partrel = resultRelInfo->ri_RelationDesc;

tuple = do_convert_tuple(tuple, map);

/*
* We must use the partition's tuple descriptor from this
* point on. Use a dedicated slot from this point on until
* we're finished dealing with the partition.
*/
slot = proute->partition_tuple_slot;
Assert(slot != NULL);
ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
ExecStoreTuple(tuple, slot, InvalidBuffer, true);
}

Should this use ConvertPartitionTupleSlot() instead?

I will have a look at it to see if we can use
ConvertPartitionTupleSlot() without any changes.
(TODO)

23. Why write;

last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans;

when you can write;

last_resultRelInfo = mtstate->resultRelInfo[mtstate->mt_nplans];?

You meant : (with &)

last_resultRelInfo = &mtstate->resultRelInfo[mtstate->mt_nplans];?

I think both are equally good, and equally readable. In this case, we
don't even want the array element, so why not just increment the
pointer to a particular offset.

24. In ExecCleanupTupleRouting(), do you think that you could just
have a special case loop for (mtstate && mtstate->operation ==
CMD_UPDATE)?

/*
* If this result rel is one of the UPDATE subplan result rels, let
* ExecEndPlan() close it. For INSERT or COPY, this does not apply
* because leaf partition result rels are always newly allocated.
*/
if (is_update &&
resultRelInfo >= first_resultRelInfo &&
resultRelInfo < last_resultRelInfo)
continue;

Something like:

if (mtstate && mtstate->operation == CMD_UPDATE)
{
ResultRelInfo *first_resultRelInfo = mtstate->resultRelInfo;
ResultRelInfo *last_resultRelInfo =
mtstate->resultRelInfo[mtstate->mt_nplans];

for (i = 0; i < proute->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = proute->partitions[i];

/*
* Leave any resultRelInfos that belong to the UPDATE's subplan
* list. These will be closed during executor shutdown.
*/
if (resultRelInfo >= first_resultRelInfo &&
resultRelInfo < last_resultRelInfo)
continue;

ExecCloseIndices(resultRelInfo);
heap_close(resultRelInfo->ri_RelationDesc, NoLock);
}
}
else
{
for (i = 0; i < proute->num_partitions; i++)
{
ResultRelInfo *resultRelInfo = proute->partitions[i];

ExecCloseIndices(resultRelInfo);
heap_close(resultRelInfo->ri_RelationDesc, NoLock);
}
}

I thought it's not worth having two separate loops in order to reduce
one if(is_update) condition in case of inserts. Although we will have
one less is_update check per partition, the code is not running
per-row.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Refactor-CheckConstraint-related-code.patchapplication/octet-stream; name=0001-Refactor-CheckConstraint-related-code.patchDownload

From 08d1434d31e696b0b8f67c5335b3f2e9834252ec Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amit.khandekar@enterprisedb.com>
Date: Fri, 5 Jan 2018 10:29:20 +0530
Subject: [PATCH] Refactor CheckConstraint() related code.

Don't make ExecPartitionCheck() abort if partition constraint check
fails. Instead, make it return false. This helps in cases where
we want to move the row to the right partition if the partition check
fails. The error-reporting code is now moved into separate function
ExecPartitionCheckEmitError().
---
 src/backend/commands/copy.c            |   2 +-
 src/backend/executor/execMain.c        | 107 +++++++++++++++++++--------------
 src/backend/executor/execPartition.c   |   5 +-
 src/backend/executor/execReplication.c |   4 +-
 src/backend/executor/nodeModifyTable.c |   4 +-
 src/include/executor/executor.h        |   7 ++-
 6 files changed, 74 insertions(+), 55 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 66cbff7..6bfca2a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2731,7 +2731,7 @@ CopyFrom(CopyState cstate)
 
 				/* Check the constraints of the tuple */
 				if (cstate->rel->rd_att->constr || check_partition_constr)
-					ExecConstraints(resultRelInfo, slot, estate);
+					ExecConstraints(resultRelInfo, slot, estate, true);
 
 				if (useHeapMultiInsert)
 				{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index d8bc502..16822e9 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1849,16 +1849,12 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
  * Exported in executor.h for outside use.
+ * Returns true if it meets the partition constraint, else returns false.
  */
-void
+bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 				   EState *estate)
 {
-	Relation	rel = resultRelInfo->ri_RelationDesc;
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	Bitmapset  *modifiedCols;
-	Bitmapset  *insertedCols;
-	Bitmapset  *updatedCols;
 	ExprContext *econtext;
 
 	/*
@@ -1886,52 +1882,69 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
 	 */
-	if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
-	{
-		char	   *val_desc;
-		Relation	orig_rel = rel;
+	return ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext);
+}
+
+/*
+ * ExecPartitionCheckEmitError - Form and emit an error message after a failed
+ * partition constraint check.
+ */
+void
+ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+							TupleTableSlot *slot,
+							EState *estate)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	Relation	orig_rel = rel;
+	TupleDesc	tupdesc = RelationGetDescr(rel);
+	char	   *val_desc;
+	Bitmapset  *modifiedCols;
+	Bitmapset  *insertedCols;
+	Bitmapset  *updatedCols;
 
-		/* See the comment above. */
-		if (resultRelInfo->ri_PartitionRoot)
+	/*
+	 * Need to first convert the tuple to the root partitioned table's row
+	 * type. For details, check similar comments in ExecConstraints().
+	 */
+	if (resultRelInfo->ri_PartitionRoot)
+	{
+		HeapTuple	tuple = ExecFetchSlotTuple(slot);
+		TupleDesc	old_tupdesc = RelationGetDescr(rel);
+		TupleConversionMap *map;
+
+		rel = resultRelInfo->ri_PartitionRoot;
+		tupdesc = RelationGetDescr(rel);
+		/* a reverse map */
+		map = convert_tuples_by_name(old_tupdesc, tupdesc,
+									 gettext_noop("could not convert row type"));
+		if (map != NULL)
 		{
-			HeapTuple	tuple = ExecFetchSlotTuple(slot);
-			TupleDesc	old_tupdesc = RelationGetDescr(rel);
-			TupleConversionMap *map;
-
-			rel = resultRelInfo->ri_PartitionRoot;
-			tupdesc = RelationGetDescr(rel);
-			/* a reverse map */
-			map = convert_tuples_by_name(old_tupdesc, tupdesc,
-										 gettext_noop("could not convert row type"));
-			if (map != NULL)
-			{
-				tuple = do_convert_tuple(tuple, map);
-				ExecSetSlotDescriptor(slot, tupdesc);
-				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-			}
+			tuple = do_convert_tuple(tuple, map);
+			ExecSetSlotDescriptor(slot, tupdesc);
+			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
-
-		insertedCols = GetInsertedColumns(resultRelInfo, estate);
-		updatedCols = GetUpdatedColumns(resultRelInfo, estate);
-		modifiedCols = bms_union(insertedCols, updatedCols);
-		val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
-												 slot,
-												 tupdesc,
-												 modifiedCols,
-												 64);
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("new row for relation \"%s\" violates partition constraint",
-						RelationGetRelationName(orig_rel)),
-				 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 	}
+
+	insertedCols = GetInsertedColumns(resultRelInfo, estate);
+	updatedCols = GetUpdatedColumns(resultRelInfo, estate);
+	modifiedCols = bms_union(insertedCols, updatedCols);
+	val_desc = ExecBuildSlotValueDescription(RelationGetRelid(rel),
+											 slot,
+											 tupdesc,
+											 modifiedCols,
+											 64);
+	ereport(ERROR,
+			(errcode(ERRCODE_CHECK_VIOLATION),
+			 errmsg("new row for relation \"%s\" violates partition constraint",
+					RelationGetRelationName(orig_rel)),
+			 val_desc ? errdetail("Failing row contains %s.", val_desc) : 0));
 }
 
 /*
  * ExecConstraints - check constraints of the tuple in 'slot'
  *
- * This checks the traditional NOT NULL and check constraints, as well as
- * the partition constraint, if any.
+ * This checks the traditional NOT NULL and check constraints, and if
+ * requested, checks the partition constraint.
  *
  * Note: 'slot' contains the tuple to check the constraints of, which may
  * have been converted from the original input tuple after tuple routing.
@@ -1939,7 +1952,8 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
  */
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate)
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint)
 {
 	Relation	rel = resultRelInfo->ri_RelationDesc;
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2055,8 +2069,9 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 		}
 	}
 
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (check_partition_constraint && resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 }
 
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 115be02..8c0d2df 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -167,8 +167,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate);
+	if (resultRelInfo->ri_PartitionCheck &&
+		!ExecPartitionCheck(resultRelInfo, slot, estate))
+		ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
 
 	/* start with the root partitioned table */
 	parent = pd[0];
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 732ed42..32891ab 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -401,7 +401,7 @@ ExecSimpleRelationInsert(EState *estate, TupleTableSlot *slot)
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can inspect. */
 		tuple = ExecMaterializeSlot(slot);
@@ -466,7 +466,7 @@ ExecSimpleRelationUpdate(EState *estate, EPQState *epqstate,
 
 		/* Check the constraints of the tuple */
 		if (rel->rd_att->constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/* Store the slot into tuple that we can write. */
 		tuple = ExecMaterializeSlot(slot);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e0748..55dff5b 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -487,7 +487,7 @@ ExecInsert(ModifyTableState *mtstate,
 
 		/* Check the constraints of the tuple */
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -1049,7 +1049,7 @@ lreplace:;
 		 * tuple-routing is performed here, hence the slot remains unchanged.
 		 */
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate);
+			ExecConstraints(resultRelInfo, slot, estate, true);
 
 		/*
 		 * replace the heap tuple
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index e6569e1..a782fae 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -187,9 +187,12 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecCleanUpTriggerState(EState *estate);
 extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
-				TupleTableSlot *slot, EState *estate);
-extern void ExecPartitionCheck(ResultRelInfo *resultRelInfo,
+				TupleTableSlot *slot, EState *estate,
+				bool check_partition_constraint);
+extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
 				   TupleTableSlot *slot, EState *estate);
+extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
+									TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 					 TupleTableSlot *slot, EState *estate);
 extern LockTupleMode ExecUpdateLockMode(EState *estate, ResultRelInfo *relinfo);
-- 
2.1.4

update-partition-key_v34.patchapplication/octet-stream; name=update-partition-key_v34.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a4..6d97f26 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried out the
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..296e301 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,16 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations"/>.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..8f83e6a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by an <command>INSERT</command> into
+    the new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and an <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6bfca2a..1000c79 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2833,7 +2833,7 @@ CopyFrom(CopyState cstate)
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (cstate->partition_tuple_routing)
-		ExecCleanupTupleRouting(cstate->partition_tuple_routing);
+		ExecCleanupTupleRouting(NULL, cstate->partition_tuple_routing);
 
 	/* Close any trigger target relations */
 	ExecCleanUpTriggerState(estate);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 1c488c3..e8af18e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	to the partition-key being changed, then this function is called once when
+ *	the row is deleted (to capture OLD row), and once when the row is inserted
+ *	into another partition (to capture NEW row).  This is done separately because
+ *	DELETE and INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE events fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for a row being inserted,
+		 * whereas newtup is NULL when the event is for a row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,18 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so we expect exactly one of them
+		 * to be NULL.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 8c0d2df..39225ff 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -54,7 +54,11 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL,
+				  *update_rri = NULL;
+	int			num_update_rri = 0,
+				update_rri_index = 0;
+	bool		is_update = false;
 	PartitionTupleRouting *proute;
 
 	/*
@@ -73,6 +77,52 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		(TupleConversionMap **) palloc0(proute->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	/* Initialization specific to update */
+	if (mtstate && mtstate->operation == CMD_UPDATE)
+	{
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+		is_update = true;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, update_rri_index should be set to the first
+		 * per-subplan result rel (i.e. 0), and then should be shifted as we
+		 * find them one by one while scanning the leaf partition oids. (It is
+		 * already set to 0 during initialization, above).
+		 */
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		proute->subplan_partition_offsets =
+			palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		proute->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(proute->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -81,20 +131,67 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 */
 	proute->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(proute->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				proute->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = &leaf_part_arr[i];
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in proute->partitions are
-		 * eventually closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * proute->partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -105,14 +202,10 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 			convert_tuples_by_name(tupDesc, part_tupdesc,
 								   gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for an INSERT.  An UPDATE
+		 * of a partition-key becomes a DELETE+INSERT operation, so this check
+		 * is still required when the operation is CMD_UPDATE.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -132,10 +225,16 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		proute->partitions[i] = leaf_part_rri++;
+		proute->partitions[i] = leaf_part_rri;
 		i++;
 	}
 
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
+
 	return proute;
 }
 
@@ -263,11 +362,18 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
  * routing.
  *
  * Close all the partitioned tables, leaf partitions, and their indices.
+ *
+ * 'mtstate' can be NULL if it is not available to the caller; e.g. for COPY.
+ * It is used only in case of updates, for accessing per-subplan result rels.
  */
 void
-ExecCleanupTupleRouting(PartitionTupleRouting * proute)
+ExecCleanupTupleRouting(ModifyTableState *mtstate,
+						PartitionTupleRouting * proute)
 {
 	int			i;
+	bool		is_update = (mtstate && mtstate->operation == CMD_UPDATE);
+	ResultRelInfo *first_resultRelInfo = NULL;
+	ResultRelInfo *last_resultRelInfo = NULL;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -284,15 +390,34 @@ ExecCleanupTupleRouting(PartitionTupleRouting * proute)
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
+	/* Save the positions of first and last UPDATE subplan result rels */
+	if (is_update)
+	{
+		first_resultRelInfo = mtstate->resultRelInfo;
+		last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans;
+	}
+
 	for (i = 0; i < proute->num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
+		/*
+		 * If this result rel is one of the UPDATE subplan result rels, let
+		 * ExecEndPlan() close it. For INSERT or COPY, this does not apply
+		 * because leaf partition result rels are always newly allocated.
+		 */
+		if (is_update &&
+			resultRelInfo >= first_resultRelInfo &&
+			resultRelInfo < last_resultRelInfo)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (proute->root_tuple_slot)
+		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 	if (proute->partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 55dff5b..e993f5d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,13 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf);
+static inline TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
+static HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
+										   HeapTuple tuple,
+										   TupleTableSlot *new_slot,
+										   TupleTableSlot **p_old_slot);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -241,6 +248,37 @@ ExecCheckTIDVisible(EState *estate,
 	ReleaseBuffer(buffer);
 }
 
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+static HeapTuple
+ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
 /* ----------------------------------------------------------------
  *		ExecInsert
  *
@@ -266,6 +304,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *ar_insert_trig_tcs;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,7 +322,6 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -332,8 +370,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -346,30 +386,20 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = proute->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = proute->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(proute->partition_tupconv_maps[leaf_part_index],
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -450,6 +480,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -467,14 +498,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -623,9 +661,33 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tuples, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE.)  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	ar_insert_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR INSERT
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_insert_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 ar_insert_trig_tcs);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +741,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tupleDeleted,
+		   bool processReturning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +750,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *ar_delete_trig_tcs;
+
+	if (tupleDeleted)
+		*tupleDeleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +918,40 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform the caller about the same */
+	if (tupleDeleted)
+		*tupleDeleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE, but only if we are capturing transition tuples.
+	 * We need to do this separately for DELETE and INSERT because they happen
+	 * on different tables.
+	 */
+	ar_delete_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR DELETE
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_delete_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 ar_delete_trig_tcs);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (processReturning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1044,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1019,6 +1116,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1034,22 +1132,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run on a leaf partition, we will not have
+			 * partition tuple routing set up. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (proute == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip the insert
+			 * as well; otherwise, an UPDATE could cause an increase in the
+			 * total number of rows across all partitions, which is clearly
+			 * wrong.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by the
+			 * EvalPlanQual machinery, but for an UPDATE that we've translated
+			 * into a DELETE from this partition and an INSERT into some other
+			 * partition, that's not available, because CTID chains can't span
+			 * relation boundaries.  We mimic the semantics to a limited extent
+			 * by skipping the INSERT if the DELETE fails to find a tuple. This
+			 * ensures that two concurrent attempts to UPDATE the same tuple at
+			 * the same time can't turn one tuple into two, and that an UPDATE
+			 * of a just-deleted tuple can't resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * Updates set the transition capture map only when a new subplan
+			 * is chosen.  But for inserts, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INSERT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(tupconv_map,
+											  tuple,
+											  proute->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate, true);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1695,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1500,62 +1717,149 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		int			numResultRelInfos;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+		ExecSetupChildParentMap(mtstate,
+								(mtstate->mt_partition_tuple_routing != NULL));
 
-		numResultRelInfos = (proute != NULL ?
-							 proute->num_partitions :
-							 mtstate->mt_nplans);
+		/*
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
+		 */
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
+
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes:
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ *    result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tuples when the target table is a partitioned
+ *    table. For updates, we need to convert the tuple from the subplan result
+ *    rel to the target table descriptor, and for inserts, we need to convert
+ *    the inserted tuple from the leaf partition to the target table
+ *    descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf)
+{
+	ResultRelInfo *rootRelInfo = getASTriggerResultRelInfo(mtstate);
+	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+	TupleDesc	outdesc;
+	int			numResultRelInfos;
+	int			i;
 
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-		/* Choose the right set of partitions */
-		if (proute != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = proute->partitions;
+	/* If perleaf is true, partition tuple routing info has to be present */
+	Assert(!perleaf || proute != NULL);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	numResultRelInfos = (perleaf ? proute->num_partitions :
+								   mtstate->mt_nplans);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based on the
+		 * partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		resultRelInfos = proute->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static inline TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+		Assert(proute && proute->subplan_partition_offsets != NULL);
+		leaf_index = proute->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < proute->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1966,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2089,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1831,9 +2134,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 	PartitionTupleRouting *proute = NULL;
 	int			num_partitions = 0;
 
@@ -1908,6 +2214,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1945,15 +2261,32 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		proute = mtstate->mt_partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(mtstate,
 										   rel, node->nominalRelation,
 										   estate);
 		num_partitions = proute->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1964,6 +2297,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1993,26 +2337,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2021,17 +2368,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2048,7 +2404,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2084,22 +2440,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2361,7 +2730,7 @@ ExecEndModifyTable(ModifyTableState *node)
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (node->mt_partition_tuple_routing)
-		ExecCleanupTupleRouting(node->mt_partition_tuple_routing);
+		ExecCleanupTupleRouting(node, node->mt_partition_tuple_routing);
 
 	/*
 	 * Free the exprcontext
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79..5e27d8c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2263,6 +2264,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(is_partition_key_update);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 30ccc9c..9461bb7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(is_partition_key_update);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df1..a067ba5 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2105,6 +2106,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2527,6 +2529,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(is_partition_key_update);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866..ea383cc 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 12a6ee4..f509359 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,7 +1364,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1403,7 +1403,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283..4ceaf17 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -279,6 +279,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2373,6 +2374,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6442,6 +6444,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6468,6 +6471,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dad..66b8356 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partKeyUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partKeyUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partKeyUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6155,17 +6159,22 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets is_partition_key_update
+ *		to true if any of the root rte's updated columns is a partition key.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (is_partition_key_update)
+		*is_partition_key_update = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6173,6 +6182,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (is_partition_key_update)
+				*is_partition_key_update = pc->is_partition_key_update;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5a08e75..7447a62 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1461,16 +1462,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		is_partition_key_update = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &is_partition_key_update);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1487,6 +1491,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->is_partition_key_update = is_partition_key_update;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1563,7 +1568,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1578,6 +1584,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*is_partition_key_update)
+		*is_partition_key_update =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1617,7 +1634,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   is_partition_key_update);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761..6d86b3a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3268,6 +3268,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the target relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3281,6 +3283,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3348,6 +3351,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index b5df357..b3ef535 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -67,6 +67,9 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -80,7 +83,9 @@ typedef struct PartitionTupleRouting
 	ResultRelInfo **partitions;
 	int			num_partitions;
 	TupleConversionMap **partition_tupconv_maps;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
@@ -90,6 +95,7 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern void ExecCleanupTupleRouting(PartitionTupleRouting *proute);
+extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
+						PartitionTupleRouting *proute);
 
 #endif							/* EXECPARTITION_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2a4f740..d57f4de 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -991,8 +991,9 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5..a9e6d45 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8..9b2fd5f 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1674,6 +1674,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2124,6 +2125,9 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		is_partition_key_update;	/* is the partition key of any of
+											 * the partitioned tables
+											 * updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 725694f..ef7173f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -242,6 +242,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 997b91f..4445878 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..0dfd3a6 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,441 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
--- ok
-update range_parted set b = b + 1 where b = 10;
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- update partition key using updatable view.
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+drop view upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+drop table mintab;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,7 +640,55 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
 create table list_parted (
 	a text,
 	b int
@@ -250,6 +703,111 @@ ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
 update list_default set a = 'x' where a = 'd';
+drop table list_parted;
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+drop trigger parted_mod_b ON sub_part1 ;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+drop trigger trig_skip_delete ON sub_part1 ;
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+drop table non_parted;
+drop function func_parted_mod_b();
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,9 +829,8 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..53c6441 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,311 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- update to a partition should check partition bound constraint for the new tuple.
+-- If partition key is updated, the row should be moved to the appropriate
+-- partition. updatable views using partitions should enforce the check options
+-- for the rows that have been moved.
+create table mintab(c1 int);
+insert into mintab values (120);
+CREATE TABLE range_parted (
 	a text,
-	b int
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
 ) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 from mintab) WITH CHECK OPTION;
+
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+create table part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+alter table range_parted attach partition part_b_20_b_30 for values from ('b', 20) to ('b', 30);
+create table part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) partition by range (c);
 create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+alter table range_parted attach partition part_b_10_b_20 for values from ('b', 10) to ('b', 20);
+create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
+create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
+
+-- This tests partition-key UPDATE on a partitioned table that does not have any child partitions
+update part_b_10_b_20 set b = b - 6;
+
+-- As mentioned above, the partition creation is intentionally kept in descending bound order.
+create table part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) partition by range (abs(d));
+alter table part_c_100_200 drop column e, drop column c, drop column a;
+alter table part_c_100_200 add column c numeric, add column e varchar, add column a text;
+alter table part_c_100_200 drop column b;
+alter table part_c_100_200 add column b bigint;
+create table part_d_1_15 partition of part_c_100_200 for values from (1) to (15);
+create table part_d_15_20 partition of part_c_100_200 for values from (15) to (20);
+
+alter table part_b_10_b_20 attach partition part_c_100_200 for values from (100) to (200);
+
+create table part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+alter table part_b_10_b_20 attach partition part_c_1_100 for values from (1) to (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted values (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted order by 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+explain (costs off) update range_parted set c = c - 50 where c > 97;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_c_100_200 set c = c - 20, d = c where c = 105;
+-- fail (no partition key update, so no attempt to move tuple, but "a = 'a'" violates partition constraint enforced by root partition)
+update part_b_10_b_20 set a = 'a';
+-- success; partition key update, no constraint violation
+update range_parted set d = d - 10 where d > 10;
+-- success; no partition key update, no constraint violation
+update range_parted set e = d;
+-- No row found :
+update part_c_1_100 set c = c + 20 where c = 98;
+-- ok (row movement)
+update part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail (row movement happens only within the partition subtree) :
+update part_b_10_b_20 set b = b - 6 where c > 116 returning *;
+-- ok (row movement, with subset of rows moved into different partition)
+update range_parted set b = b - 6 where c > 116 returning a, b + c;
+
+:show_data;
+
+-- update partition key using updatable view.
+
+-- succeeds
+update upview set c = 199 where b = 4;
+-- fail, check option violation
+update upview set c = 120 where b = 4;
+-- fail, row movement with check option violation
+update upview set a = 'b', b = 15, c = 120 where b = 4;
+-- succeeds, row movement , check option passes
+update upview set a = 'b', b = 15 where b = 4;
+
+:show_data;
+
+-- cleanup
+drop view upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+update range_parted set c = 95 where a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+create function trans_updatetrigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' order by a) from old_table),
+                 (select string_agg(new_table::text, ', ' order by a) from new_table);
+    return null;
+  end;
+$$;
+
+create trigger trans_updatetrig
+  after update on range_parted referencing old table as old_table new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+create trigger trans_deletetrig
+  after delete on range_parted referencing old table as old_table
+  for each statement execute procedure trans_updatetrigfunc();
+create trigger trans_inserttrig
+  after insert on range_parted referencing new table as new_table
+  for each statement execute procedure trans_updatetrigfunc();
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trans_updatetrig ON range_parted;
+drop trigger trans_deletetrig ON range_parted;
+drop trigger trans_inserttrig ON range_parted;
+
+-- Install BR triggers on child partition, so that transition tuple conversion takes place.
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = NEW.b + 1;
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_c1_100 before update or insert on part_c_1_100
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d1_15 before update or insert on part_d_1_15
+   for each row execute procedure func_parted_mod_b();
+create trigger trig_d15_20 before update or insert on part_d_15_20
+   for each row execute procedure func_parted_mod_b();
+:init_range_parted;
+update range_parted set c = (case when c = 96 then 110 else c + 1 end ) where a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+update range_parted set c = c + 50 where a = 'b' and b > 10 and c >= 96;
+:show_data;
+drop trigger trig_c1_100 ON part_c_1_100;
+drop trigger trig_d1_15 ON part_d_1_15;
+drop trigger trig_d15_20 ON part_d_15_20;
+drop function func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+create user regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);
+create policy policy_range_parted on range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
+set session authorization regress_range_parted_user;
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+-- Create a trigger on part_d_1_15
+create function func_d_1_15() returns trigger as $$
+begin
+   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa
+   return NEW;
+end $$ language plpgsql;
+create trigger trig_d_1_15 before insert on part_d_1_15
+   for each row execute procedure func_d_1_15();
+
+:init_range_parted;
+set session authorization regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15, because trigger makes 'c' value an even number.
+update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;
+
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- This should fail with RLS violation error because trigger makes 'c' value
+-- an odd number.
+update range_parted set a = 'b', c = 150 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop trigger trig_d_1_15 ON part_d_1_15;
+drop function func_d_1_15();
+
+-- Policy expression contains SubPlan
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+set session authorization regress_range_parted_user;
+-- Should fail because mintab has row with c1 = 120
+update range_parted set a = 'b', c = 122 where a = 'a' and c = 200;
+-- Should pass
+update range_parted set a = 'b', c = 120 where a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+reset session authorization;
+:init_range_parted;
+create policy policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+set session authorization regress_range_parted_user;
+-- Should succeed the RLS check
+update range_parted set a = 'b', c = 112 where a = 'a' and c = 200;
+reset session authorization;
+:init_range_parted;
+set session authorization regress_range_parted_user;
+-- The whole row RLS check should fail
+update range_parted set a = 'b', c = 116 where a = 'a' and c = 200;
+
+-- Cleanup
+reset session authorization;
+drop policy policy_range_parted ON range_parted;
+drop policy policy_range_parted_subplan ON range_parted;
+drop policy policy_range_parted_wholerow ON range_parted;
+revoke all ON range_parted, mintab FROM regress_range_parted_user ;
+drop user regress_range_parted_user;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+create function trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+create trigger parent_delete_trig
+  after delete on range_parted for each statement execute procedure trigfunc();
+create trigger parent_update_trig
+  after update on range_parted for each statement execute procedure trigfunc();
+create trigger parent_insert_trig
+  after insert on range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+create trigger c1_delete_trig
+  after delete on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_update_trig
+  after update on part_c_1_100 for each statement execute procedure trigfunc();
+create trigger c1_insert_trig
+  after insert on part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+create trigger d1_delete_trig
+  after delete on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_update_trig
+  after update on part_d_1_15 for each statement execute procedure trigfunc();
+create trigger d1_insert_trig
+  after insert on part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+create trigger d15_delete_trig
+  after delete on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_update_trig
+  after update on part_d_15_20 for each statement execute procedure trigfunc();
+create trigger d15_insert_trig
+  after insert on part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or insert statement triggers should be fired.
+update range_parted set c = c - 50 where c > 97;
+:show_data;
+
+drop trigger parent_delete_trig ON range_parted;
+drop trigger parent_update_trig ON range_parted;
+drop trigger parent_insert_trig ON range_parted;
+drop trigger c1_delete_trig ON part_c_1_100;
+drop trigger c1_update_trig ON part_c_1_100;
+drop trigger c1_insert_trig ON part_c_1_100;
+drop trigger d1_delete_trig ON part_d_1_15;
+drop trigger d1_update_trig ON part_d_1_15;
+drop trigger d1_insert_trig ON part_d_1_15;
+drop trigger d15_delete_trig ON part_d_15_20;
+drop trigger d15_update_trig ON part_d_15_20;
+drop trigger d15_insert_trig ON part_d_15_20;
+
+drop table mintab;
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
--- ok
-update range_parted set b = b + 1 where b = 10;
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,6 +420,21 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- Fail, default partition is not under part_a_10_a_20;
+update part_a_10_a_20 set a = 'ad' where a = 'a';
+-- Success
+update range_parted set a = 'ad' where a = 'a';
+update range_parted set a = 'bd' where a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- Success
+update range_parted set a = 'a' where a = 'ad';
+update range_parted set a = 'b' where a = 'bd';
+:show_data;
+
 create table list_parted (
 	a text,
 	b int
@@ -148,6 +449,84 @@ update list_default set a = 'a' where a = 'd';
 -- ok
 update list_default set a = 'x' where a = 'd';
 
+drop table list_parted;
+
+--------------
+-- UPDATE with
+-- partition key or non-partition columns, with different column ordering,
+-- triggers.
+--------------
+
+-- Setup
+--------
+create table list_parted (a numeric, b int, c int8) partition by list (a);
+create table sub_parted partition of list_parted for values in (1) partition by list (b);
+
+create table sub_part1(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part1 for values in (1);
+create table sub_part2(b int, c int8, a numeric);
+alter table sub_parted attach partition sub_part2 for values in (2);
+
+create table list_part1(a numeric, b int, c int8);
+alter table list_parted attach partition list_part1 for values in (2,3);
+
+insert into list_parted values (2,5,50);
+insert into list_parted values (3,6,60);
+insert into sub_parted values (1,1,60);
+insert into sub_parted values (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+update sub_parted set a = 2 where c = 10;
+
+-- UPDATE which does not modify partition key of partitions that are chosen for update.
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+update list_parted set b = c + a where a = 2;
+select tableoid::regclass::text , * from list_parted where a = 2 order by 1;
+
+
+-----------
+-- Triggers can cause UPDATE row movement if it modified partition key.
+-----------
+create function func_parted_mod_b() returns trigger as $$
+begin
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+end $$ language plpgsql;
+create trigger parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1
+update list_parted set c = 70 where b  = 1 ;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger parted_mod_b ON sub_part1 ;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+create or replace function func_parted_mod_b() returns trigger as $$
+begin return NULL; end $$ language plpgsql;
+create trigger trig_skip_delete before delete on sub_part1
+   for each row execute procedure func_parted_mod_b();
+update list_parted set b = 1 where c = 70;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+
+drop trigger trig_skip_delete ON sub_part1 ;
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+create table non_parted (id int);
+insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
+update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
+select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
+drop table non_parted;
+
+drop function func_parted_mod_b();
+
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -169,6 +548,7 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok : row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;

#226

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#225)

Re: [HACKERS] UPDATE of partition key

On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The above patch is to be applied over the last remaining preparatory
patch, now named (and attached) :
0001-Refactor-CheckConstraint-related-code.patch

Committed that one, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#227

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: David Rowley (#221)

Re: [HACKERS] UPDATE of partition key

On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote:

I'll try to look at the tests tomorrow and also do some testing.

I've made a pass over the tests. Again, sometimes I'm probably a bit
pedantic. The reason for that is that the tests are not that easy to
follow. Moving creation and cleanup of objects closer to where they're
used and no longer needed makes it easier to read through and verify
the tests. There are some genuine mistakes in there too.

NEW.c = NEW.c + 1; -- Make even number odd, or vice versa

This seems to be worded as if there'd only ever be one number. I think
it should be plural and read "Make even numbers odd, and vice versa"

2. The following comment does not make a huge amount of sense.

-- UPDATE with
-- partition key or non-partition columns, with different column ordering,
-- triggers.

Should "or" be "on"? Does ", triggers" mean "with triggers"?

3. The follow test tries to test a BEFORE DELETE trigger stopping a
DELETE on sub_part1, but going by the SELECT, there are no rows in
that table to stop being DELETEd.

select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
tableoid | a | b | c
------------+---+----+----
list_part1 | 2 | 52 | 50
list_part1 | 3 | 6 | 60
sub_part2 | 1 | 2 | 10
sub_part2 | 1 | 2 | 70
(4 rows)

drop trigger parted_mod_b ON sub_part1 ;
-- If BR DELETE trigger prevented DELETE from happening, we should also skip
-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
create or replace function func_parted_mod_b() returns trigger as $$
begin return NULL; end $$ language plpgsql;
create trigger trig_skip_delete before delete on sub_part1
for each row execute procedure func_parted_mod_b();
update list_parted set b = 1 where c = 70;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
tableoid | a | b | c
------------+---+----+----
list_part1 | 2 | 52 | 50
list_part1 | 3 | 6 | 60
sub_part1 | 1 | 1 | 70
sub_part2 | 1 | 2 | 10
(4 rows)

You've added the BEFORE DELETE trigger to sub_part1, but you can see
the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so
the test is not working as you've commented.

It's probably a good idea to RAISE NOTICE 'something useful here'; in
the trigger function to verify they're actually being called in the
test.

4. I think the final drop function in the following should be before
the UPDATE FROM test. You've already done some cleanup for that test
by doing "drop trigger trig_skip_delete ON sub_part1 ;"

drop trigger trig_skip_delete ON sub_part1 ;
-- UPDATE partition-key with FROM clause. If join produces multiple output
-- rows for the same row to be modified, we should tuple-route the row
only once.
-- There should not be any rows inserted.
create table non_parted (id int);
insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
tableoid | a | b | c
------------+---+----+----
list_part1 | 2 | 1 | 70
list_part1 | 2 | 2 | 10
list_part1 | 2 | 52 | 50
list_part1 | 3 | 6 | 60
(4 rows)

drop table non_parted;
drop function func_parted_mod_b();

Also, there's a space before the ; in the drop trigger above. Can that
be removed?

5. The following comment:

-- update to a partition should check partition bound constraint for
the new tuple.
-- If partition key is updated, the row should be moved to the appropriate
-- partition. updatable views using partitions should enforce the check options
-- for the rows that have been moved.

Can this be changed a bit? I think it's not accurate to say that an
update to a partition key causes the row to move. The row movement
only occurs when the new tuple does not match the partition bound and
another partition exists that does have a partition bound that matches
the tuple. How about:

-- When a partitioned table receives an UPDATE to the partitioned key and the
-- new values no longer meet the partition's bound, the row must be moved to
-- the correct partition for the new partition key (if one exists). We must
-- also ensure that updatable views on partitioned tables properly enforce any
-- WITH CHECK OPTION that is defined. The situation with triggers in this case
-- also requires thorough testing as partition key updates causing row
-- movement convert UPDATEs into DELETE+INSERT.

6. What does the following actually test?

-- This tests partition-key UPDATE on a partitioned table that does
not have any child partitions
update part_b_10_b_20 set b = b - 6;

There are no records in that partition, or anywhere in the hierarchy.
Are you just testing that there's no error? If so then the comment
should say so.

7. I think the following comment:

-- As mentioned above, the partition creation is intentionally kept in
descending bound order.

should instead say:

-- Create some more partitions following the above pattern of descending bound
-- order, but let's make the situation a bit more complex by having the
-- attribute numbers of the columns vary from their parent partition.

8. Just to make the tests a bit easier to follow, can you move the
following down to where you're first using it:

create table mintab(c1 int);
insert into mintab values (120);

and

CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1
from mintab) WITH CHECK OPTION;

9. It seems that the existing part of update.sql capitalises SQL
keywords, but you mostly don't. I understand we're not always
consistent, but can you keep it the same as the existing part of the
file?

10. Stray space before trailing ':'

-- fail (row movement happens only within the partition subtree) :

11. Can the following become:

-- succeeds, row movement , check option passes

-- success, update with row movement, check option passes:

Seems there's also quite a mix of comment formats in your tests.

You're using either one of; ok, success, succeeds followed by
sometimes a comma, and sometimes a reason in parentheses. The existing
part of the file seems to use:

-- fail, <reason>:

and just

-- <reason>:

for non-failures.

Would be great to stick to what's there.

12. The following comment seems to indicate that you're installing
triggers on all leaf partitions, but that's not the case:

-- Install BR triggers on child partition, so that transition tuple
conversion takes place.

maybe you should write "on some child partitions"? Or did you mean to
define a trigger on them all?

13. Stray space at the end of the case statement:

update range_parted set c = (case when c = 96 then 110 else c + 1 end
) where a = 'b' and b > 10 and c >= 96;

14. Stray space in the USING clause:

create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);

15. we -> we're

-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.

16. The comment probably should be before the "update range_parted",
not the "set session authorization":

-- This should fail with RLS violation error while moving row from
-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
set session authorization regress_range_parted_user;
update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;

17. trigger -> the trigger function

-- part_d_1_15, because trigger makes 'c' value an even number.

likewise in:

-- This should fail with RLS violation error because trigger makes 'c' value
-- an odd number.

18. Why two RESET SESSION AUTHORIZATIONs?

reset session authorization;
drop trigger trig_d_1_15 ON part_d_1_15;
drop function func_d_1_15();
-- Policy expression contains SubPlan
reset session authorization;

19. The following should be cleaned up in the final test that its used
on rather than randomly after the next test after it:

drop table mintab;

20. Comment is not worded very well:

-- UPDATE which does not modify partition key of partitions that are
chosen for update.

Does "partitions that are chosen for update" mean "the UPDATE target"?

I'm also not quite sure what the test is testing. In the past I've
written tests that have a header comment as -- Ensure that <what the
test is testing>. Perhaps if you can't think of what you're ensuring
with the test, then the test might not be that worthwhile.

21. The following comment could be improved:

-- Triggers can cause UPDATE row movement if it modified partition key.

Might be better to write:

-- Tests for BR UPDATE triggers changing the partition key.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#228

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#222)

Re: [HACKERS] UPDATE of partition key

On Thu, Jan 4, 2018 at 1:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

------------------
1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
------------------

It seems like the ON CONFLICT stuff handled that by adding a
second TransitionCaptureState pointer to ModifyTable, thus
mt_transition_capture and mt_oc_transition_capture. By that
precedent, we could add mt_utr_transition_capture or similar, and
maybe that's the way to go. It seems a bit unsatisfying, but so does
what you have now.

In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
insert_tcs_maps for UPDATE/DELETE and INSERT events respectively.

That's not what I suggested. If you look at what I wrote, I floated
the idea of having two TransitionCaptureStates, not two separate maps
within the same TransitionCaptureState.

------------------
2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
------------------
------------------
3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps
------------------

We need to change it's name because now this map is not only used for
transition capture, but also for update-tuple-routing. Does it look ok
for you if, for readability, we keep the childparent tag ? Or else, we
can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
looks more informative.

I see your point: the array is being renamed because it now has more
than one purpose. But that's also what I'm complaining about with
regard to point #2: the same array is being used for more than one
purpose. That's generally bad style. If you have two loops in a
function, it's best to declare two separate loop variables rather than
reusing the same variable. This lets the compiler detect, for
example, an error where the second loop variable is used before it's
initialized, which would be undetectable if you reused the same
variable in both places. Although that particular benefit doesn't
pertain in this case, I maintain that having a single structure member
that is indexed one of two different ways is a bad idea.

If I understand correctly, the way we got here is that, in earlier
patch versions, you had two arrays of maps, but it wasn't clear why we
needed both of them, and David suggested replacing one of them with an
array of indexes instead, in the hopes of reducing confusion.
However, it looks to me like that didn't really work out. If we
always needed both maps, or even if we always needed the per-leaf map,
it would have been a good idea, but it seems here that we can need
either the per-leaf map or the per-subplan map or both or neither, and
we want to avoid computing all of the per-leaf conversion maps if we
only need per-subplan access.

I think one way to fix this might be to build the per-leaf maps on
demand. Just because we're doing UPDATE tuple routing doesn't
necessarily mean we'll actually need a TupleConversionMap for every
child. So we could allocate an array with one byte per leaf, where 0
means we don't know whether tuple conversion is necessary, 1 means it
is not, and 2 means it is, or something like that. Then we have a
second array with conversion maps. We provide a function
tupconv_map_for_leaf() or similar that checks the array; if it finds
1, it returns NULL; if it finds 2, it returns the conversion map
previously calculated. If it finds 0, it calls convert_tuples_by_name,
caches the result for later, updates the one-byte-per-leaf array with
the appropriate value, and returns the just-computed conversion map.
(The reason I'm suggesting 0/1/2 instead of just true/false is to
reduce cache misses; if we find a 1 in the first array we don't need
to access the second array at all.)

If that doesn't seem like a good idea for some reason, then my second
choice would be to leave mt_transition_tupconv_maps named the way it
is currently and have a separate mt_update_tupconv_maps, with the two
pointing, if both are initialized and as far as possible, to the same
TupleConversionMap objects.

-------------------
4. Explicit signaling for "we are only here for transition tables"
-------------------

I had given a thought on this earlier. I felt, even the pre-existing
conditions like "!trigdesc->trig_update_after_row" are all indirect
ways to determine that this function is called only to capture
transition tables, and thought that it may have been better to have
separate parameter transition_table_only.

I see your point. I guess it's not really this patch's job to solve
this problem, although I think this is going to need some refactoring
in the not-too-distant future. So I think the way you did it is
probably OK.

Instead of adding another parameter to AfterTriggerSaveEvent(), I had
also considered another approach: Put the transition-tuples-capture
logic part of AfterTriggerSaveEvent() into a helper function
CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
of calling ExecARUpdateTriggers(), call this function
CaptureTransitionTables(). I then dropped this idea and thought rather
to call ExecARUpdateTriggers() which neatly does the required checks
and other things like locking the old tuple via GetTupleForTrigger().
So if we go by CaptureTransitionTables(), we would need to do what
ExecARUpdateTriggers() does before calling CaptureTransitionTables().
This is doable. If you think this is worth doing so as to get rid of
the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.

Duplicating logic elsewhere to avoid this problem here doesn't seem
like a good plan.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#229

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Robert Haas (#226)

Re: [HACKERS] UPDATE of partition key

On Fri, Jan 5, 2018 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The above patch is to be applied over the last remaining preparatory
patch, now named (and attached) :
0001-Refactor-CheckConstraint-related-code.patch

Committed that one, too.

Some more comments on the main patch:

I don't really like the fact that ExecCleanupTupleRouting() now takes
a ModifyTableState as an argument, particularly because of the way
that is using that argument. To figure out whether a ResultRelInfo
was pre-existing or one it created, it checks whether the pointer
address of the ResultRelInfo is >= mtstate->resultRelInfo and <
mtstate->resultRelInfo + mtstate->mt_nplans. However, that means that
ExecCleanupTupleRouting() ends up knowing about the memory allocation
pattern used by ExecInitModifyTable(), which seems like a slightly
dangerous amount of action at a distance. I think it would be better
for the PartitionTupleRouting structure to explicitly indicate which
ResultRelInfos should be closed, for example by storing a Bitmapset
*input_partitions. (Here, by "input", I mean "provided from the
mtstate rather than created by the PartitionTupleRouting structure;
other naming suggestions welcome.) When
ExecSetupPartitionTupleRouting latches onto a partition, it can do
proute->input_partitions = bms_add_member(proute->input_partitons, i).
In ExecCleanupTupleRouting, it can do if
(bms_is_member(proute->input_partitions, i)) continue.

We have a test, in the regression test suite for file_fdw, which
generates the message "cannot route inserted tuples to a foreign
table". I think we should have a similar test for the case where an
UPDATE tries to move a tuple from a regular partition to a foreign
table partition. I'm not sure if it should fail with the same error
or a different one, but I think we should have a test that it fails
cleanly and with a nice error message of some sort.

The comment for get_partitioned_child_rels() claims that it sets
is_partition_key_update, but it really sets *is_partition_key_update.
And I think instead of "is a partition key" it should say "is used in
the partition key either of the relation whose RTI is specified or of
any child relation." I propose "used in" instead of "is" because
there can be partition expressions, and the rest is to clarify that
child partition keys matter.

create_modifytable_path uses partColsUpdated rather than
partKeyUpdated, which actually seems like better terminology. I
propose partKeyUpdated -> partColsUpdated everywhere. Also, why use
is_partition_key_update for basically the same thing in some other
places? I propose changing that to partColsUpdated as well.

The capitalization of the first comment hunk in execPartition.h is strange.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#230

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: David Rowley (#227)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 4 January 2018 at 02:52, David Rowley <david.rowley@2ndquadrant.com> wrote:

1.

NEW.c = NEW.c + 1; -- Make even number odd, or vice versa

This seems to be worded as if there'd only ever be one number. I think
it should be plural and read "Make even numbers odd, and vice versa"

Done.

2. The following comment does not make a huge amount of sense.

-- UPDATE with
-- partition key or non-partition columns, with different column ordering,
-- triggers.

Should "or" be "on"? Does ", triggers" mean "with triggers"?

Actually I was trying to summarize what kinds of scenarios are going
to be tested. Now I think we don't have to give this summary. Rather,
we should describe each of the scenarios individually. But I did want
to use list partitions at least in a subset of update-partition-key
scenarios. So I have removed this comment, and replaced it by :

-- Some more update-partition-key test scenarios below. This time use list
-- partitions.

3. The follow test tries to test a BEFORE DELETE trigger stopping a
DELETE on sub_part1, but going by the SELECT, there are no rows in
that table to stop being DELETEd.

select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
tableoid | a | b | c
------------+---+----+----
list_part1 | 2 | 52 | 50
list_part1 | 3 | 6 | 60
sub_part2 | 1 | 2 | 10
sub_part2 | 1 | 2 | 70
(4 rows)

drop trigger parted_mod_b ON sub_part1 ;
-- If BR DELETE trigger prevented DELETE from happening, we should also skip
-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
create or replace function func_parted_mod_b() returns trigger as $$
begin return NULL; end $$ language plpgsql;
create trigger trig_skip_delete before delete on sub_part1
for each row execute procedure func_parted_mod_b();
update list_parted set b = 1 where c = 70;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
tableoid | a | b | c
------------+---+----+----
list_part1 | 2 | 52 | 50
list_part1 | 3 | 6 | 60
sub_part1 | 1 | 1 | 70
sub_part2 | 1 | 2 | 10
(4 rows)

You've added the BEFORE DELETE trigger to sub_part1, but you can see
the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so
the test is not working as you've commented.

It's probably a good idea to RAISE NOTICE 'something useful here'; in
the trigger function to verify they're actually being called in the
test.

Done. The trigger should have been for sub_part2, not sub_part1. Corrected that.
Also, dropped the trigger and again tested the UPDATE.

4. I think the final drop function in the following should be before
the UPDATE FROM test. You've already done some cleanup for that test
by doing "drop trigger trig_skip_delete ON sub_part1 ;"

drop trigger trig_skip_delete ON sub_part1 ;
-- UPDATE partition-key with FROM clause. If join produces multiple output
-- rows for the same row to be modified, we should tuple-route the row
only once.
-- There should not be any rows inserted.
create table non_parted (id int);
insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
tableoid | a | b | c
------------+---+----+----
list_part1 | 2 | 1 | 70
list_part1 | 2 | 2 | 10
list_part1 | 2 | 52 | 50
list_part1 | 3 | 6 | 60
(4 rows)

drop table non_parted;
drop function func_parted_mod_b();

Done. Moved it to relevant place.

Also, there's a space before the ; in the drop trigger above. Can that
be removed?

Removed.

5. The following comment:

-- update to a partition should check partition bound constraint for
the new tuple.
-- If partition key is updated, the row should be moved to the appropriate
-- partition. updatable views using partitions should enforce the check options
-- for the rows that have been moved.

Can this be changed a bit? I think it's not accurate to say that an
update to a partition key causes the row to move. The row movement
only occurs when the new tuple does not match the partition bound and
another partition exists that does have a partition bound that matches
the tuple. How about:

-- When a partitioned table receives an UPDATE to the partitioned key and the
-- new values no longer meet the partition's bound, the row must be moved to
-- the correct partition for the new partition key (if one exists). We must
-- also ensure that updatable views on partitioned tables properly enforce any
-- WITH CHECK OPTION that is defined. The situation with triggers in this case
-- also requires thorough testing as partition key updates causing row
-- movement convert UPDATEs into DELETE+INSERT.

Done.

6. What does the following actually test?

-- This tests partition-key UPDATE on a partitioned table that does
not have any child partitions
update part_b_10_b_20 set b = b - 6;

There are no records in that partition, or anywhere in the hierarchy.
Are you just testing that there's no error? If so then the comment
should say so.

Yes, I understand that there won't be any update scan plans. But, with
the modifications done in ExecInitModifyTable(), I wanted to run that
code with this scenario where there are no partitions, to make sure it
does not behave weirdly or crash. Any suggestions for comments, given
this perspective ? For now, I have made the comment this way:

-- Check that partition-key UPDATE works sanely on a partitioned table
that does not have any child partitions.

7. I think the following comment:

-- As mentioned above, the partition creation is intentionally kept in
descending bound order.

should instead say:

-- Create some more partitions following the above pattern of descending bound
-- order, but let's make the situation a bit more complex by having the
-- attribute numbers of the columns vary from their parent partition.

Done.

8. Just to make the tests a bit easier to follow, can you move the
following down to where you're first using it:

create table mintab(c1 int);
insert into mintab values (120);

and

CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1
from mintab) WITH CHECK OPTION;

Done.

9. It seems that the existing part of update.sql capitalises SQL
keywords, but you mostly don't. I understand we're not always
consistent, but can you keep it the same as the existing part of the
file?

Done.

10. Stray space before trailing ':'

-- fail (row movement happens only within the partition subtree) :

Done, at other applicable places also.

11. Can the following become:

-- succeeds, row movement , check option passes

-- success, update with row movement, check option passes:

Seems there's also quite a mix of comment formats in your tests.

You're using either one of; ok, success, succeeds followed by
sometimes a comma, and sometimes a reason in parentheses. The existing
part of the file seems to use:

-- fail, <reason>:

and just

-- <reason>:

for non-failures.

Would be great to stick to what's there.

There were existing lines where "ok, " was used.
So, now used this everywhere :
ok, ...
fail, ...

12. The following comment seems to indicate that you're installing
triggers on all leaf partitions, but that's not the case:

-- Install BR triggers on child partition, so that transition tuple
conversion takes place.

maybe you should write "on some child partitions"? Or did you mean to
define a trigger on them all?

Trigger should be installed at least on the partitions onto which rows
are moved. I have corrected the comment accordingly.

Actually, to test transition tuple conversion with
update-row-movement, it requires a statement level trigger that
references transition tables. And trans_updatetrig already was
dropped. So transition tuple conversion for rows being inserted did
not get tested (I had manually tested though). So I have moved down
the drop statement.

13. Stray space at the end of the case statement:

update range_parted set c = (case when c = 96 then 110 else c + 1 end
) where a = 'b' and b > 10 and c >= 96;

Done.

14. Stray space in the USING clause:

create policy seeall ON range_parted as PERMISSIVE for SELECT using ( true);

Done

15. we -> we're
-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.

Changed it to "we are"

16. The comment probably should be before the "update range_parted",
not the "set session authorization":
-- This should fail with RLS violation error while moving row from
-- part_a_10_a_20 to part_d_1_15, because we setting 'c' to an odd number.
set session authorization regress_range_parted_user;
update range_parted set a = 'b', c = 151 where a = 'a' and c = 200;

Moved "set session authorization" statement above the comment.

17. trigger -> the trigger function

-- part_d_1_15, because trigger makes 'c' value an even number.

likewise in:

-- This should fail with RLS violation error because trigger makes 'c' value
-- an odd number.

I have made changes to the comment to make it clearer. Finally, the statement
contains phrase "trigger at the destination partition again makes it
an even number". With this phrase, "trigger function at destination
partition" looks odd. So I think "trigger at destination partition
makes ..." looks ok. It is implied that it is the trigger function
that is actually changing the value.

18. Why two RESET SESSION AUTHORIZATIONs?

reset session authorization;
drop trigger trig_d_1_15 ON part_d_1_15;
drop function func_d_1_15();
-- Policy expression contains SubPlan
reset session authorization;

The second reset is actually in a different paragraph. The reason it's
there is to ensure we have reset it regardless of the earlier cleanup.

19. The following should be cleaned up in the final test that its used
on rather than randomly after the next test after it:

drop table mintab;

Done.

20. Comment is not worded very well:

-- UPDATE which does not modify partition key of partitions that are
chosen for update.

Does "partitions that are chosen for update" mean "the UPDATE target"?

Actually it means the partitions participating in the update subplans,
i.e the unpruned ones.

I have modified the comment as :
-- Test update-partition-key, where the unpruned partitions do not have their
-- partition keys updated.

I'm also not quite sure what the test is testing. In the past I've
written tests that have a header comment as -- Ensure that <what the
test is testing>. Perhaps if you can't think of what you're ensuring
with the test, then the test might not be that worthwhile.

I am just testing that the update behaves sanely in the particular scenario.

BTW, it was a concious decision made that in this particular scenario,
we still conclude internally that update-tuple-routing is needed, and
do the tuple routing setup.

21. The following comment could be improved:

-- Triggers can cause UPDATE row movement if it modified partition key.

Might be better to write:

-- Tests for BR UPDATE triggers changing the partition key.

Done

I have also done this following suggestion of yours :

22. In copy.c CopyFrom() you have the following code:

/*
* We might need to convert from the parent rowtype to the
* partition rowtype.
*/
map = proute->partition_tupconv_maps[leaf_part_index];
if (map)
{
Relation partrel = resultRelInfo->ri_RelationDesc;

tuple = do_convert_tuple(tuple, map);

/*
* We must use the partition's tuple descriptor from this
* point on. Use a dedicated slot from this point on until
* we're finished dealing with the partition.
*/
slot = proute->partition_tuple_slot;
Assert(slot != NULL);
ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
ExecStoreTuple(tuple, slot, InvalidBuffer, true);
}

Should this use ConvertPartitionTupleSlot() instead?

Attached v35 patch. Thanks.

Attachments:

update-partition-key_v35.patchapplication/octet-stream; name=update-partition-key_v35.patchDownload

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a4..6d97f26 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried out the
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..296e301 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,16 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations"/>.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..8f83e6a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by an <command>INSERT</command> into
+    the new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and an <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6bfca2a..51fc961 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2587,7 +2587,6 @@ CopyFrom(CopyState cstate)
 		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
-			TupleConversionMap *map;
 			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
 
 			/*
@@ -2668,23 +2667,10 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = proute->partition_tupconv_maps[leaf_part_index];
-			if (map)
-			{
-				Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-				tuple = do_convert_tuple(tuple, map);
-
-				/*
-				 * We must use the partition's tuple descriptor from this
-				 * point on.  Use a dedicated slot from this point on until
-				 * we're finished dealing with the partition.
-				 */
-				slot = proute->partition_tuple_slot;
-				Assert(slot != NULL);
-				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-			}
+			tuple = ConvertPartitionTupleSlot(proute->partition_tupconv_maps[leaf_part_index],
+											  tuple,
+											  proute->partition_tuple_slot,
+											  &slot);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
@@ -2833,7 +2819,7 @@ CopyFrom(CopyState cstate)
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (cstate->partition_tuple_routing)
-		ExecCleanupTupleRouting(cstate->partition_tuple_routing);
+		ExecCleanupTupleRouting(NULL, cstate->partition_tuple_routing);
 
 	/* Close any trigger target relations */
 	ExecCleanUpTriggerState(estate);
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 1c488c3..e8af18e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	to the partition-key being changed, then this function is called once when
+ *	the row is deleted (to capture OLD row), and once when the row is inserted
+ *	into another partition (to capture NEW row).  This is done separately because
+ *	DELETE and INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE events fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for a row being inserted,
+		 * whereas newtup is NULL when the event is for a row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,18 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so we expect exactly one of them
+		 * to be NULL.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 8c0d2df..3f7b5f4 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -54,7 +54,11 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL,
+				  *update_rri = NULL;
+	int			num_update_rri = 0,
+				update_rri_index = 0;
+	bool		is_update = false;
 	PartitionTupleRouting *proute;
 
 	/*
@@ -73,6 +77,52 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		(TupleConversionMap **) palloc0(proute->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	/* Initialization specific to update */
+	if (mtstate && mtstate->operation == CMD_UPDATE)
+	{
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+		is_update = true;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, update_rri_index should be set to the first
+		 * per-subplan result rel (i.e. 0), and then should be shifted as we
+		 * find them one by one while scanning the leaf partition oids. (It is
+		 * already set to 0 during initialization, above).
+		 */
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		proute->subplan_partition_offsets =
+			palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		proute->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(proute->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -81,20 +131,67 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 */
 	proute->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(proute->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				proute->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = &leaf_part_arr[i];
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in proute->partitions are
-		 * eventually closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * proute->partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -105,14 +202,10 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 			convert_tuples_by_name(tupDesc, part_tupdesc,
 								   gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for an INSERT.  An UPDATE
+		 * of a partition-key becomes a DELETE+INSERT operation, so this check
+		 * is still required when the operation is CMD_UPDATE.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -132,10 +225,16 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		proute->partitions[i] = leaf_part_rri++;
+		proute->partitions[i] = leaf_part_rri;
 		i++;
 	}
 
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
+
 	return proute;
 }
 
@@ -259,15 +358,53 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 }
 
 /*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+HeapTuple
+ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
+/*
  * ExecCleanupTupleRouting -- Clean up objects allocated for partition tuple
  * routing.
  *
  * Close all the partitioned tables, leaf partitions, and their indices.
+ *
+ * 'mtstate' can be NULL if it is not available to the caller; e.g. for COPY.
+ * It is used only in case of updates, for accessing per-subplan result rels.
  */
 void
-ExecCleanupTupleRouting(PartitionTupleRouting * proute)
+ExecCleanupTupleRouting(ModifyTableState *mtstate,
+						PartitionTupleRouting * proute)
 {
 	int			i;
+	bool		is_update = (mtstate && mtstate->operation == CMD_UPDATE);
+	ResultRelInfo *first_resultRelInfo = NULL;
+	ResultRelInfo *last_resultRelInfo = NULL;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -284,15 +421,34 @@ ExecCleanupTupleRouting(PartitionTupleRouting * proute)
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
+	/* Save the positions of first and last UPDATE subplan result rels */
+	if (is_update)
+	{
+		first_resultRelInfo = mtstate->resultRelInfo;
+		last_resultRelInfo = mtstate->resultRelInfo + mtstate->mt_nplans;
+	}
+
 	for (i = 0; i < proute->num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
+		/*
+		 * If this result rel is one of the UPDATE subplan result rels, let
+		 * ExecEndPlan() close it. For INSERT or COPY, this does not apply
+		 * because leaf partition result rels are always newly allocated.
+		 */
+		if (is_update &&
+			resultRelInfo >= first_resultRelInfo &&
+			resultRelInfo < last_resultRelInfo)
+			continue;
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (proute->root_tuple_slot)
+		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 	if (proute->partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 55dff5b..e9c0b23 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,9 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf);
+static inline TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -266,6 +269,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *ar_insert_trig_tcs;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,7 +287,6 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -332,8 +335,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -346,30 +351,20 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = proute->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = proute->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(proute->partition_tupconv_maps[leaf_part_index],
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -450,6 +445,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -467,14 +463,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -623,9 +626,33 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tuples, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE.)  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	ar_insert_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR INSERT
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_insert_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 ar_insert_trig_tcs);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +706,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tupleDeleted,
+		   bool processReturning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +715,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *ar_delete_trig_tcs;
+
+	if (tupleDeleted)
+		*tupleDeleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +883,40 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform the caller about the same */
+	if (tupleDeleted)
+		*tupleDeleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE, but only if we are capturing transition tuples.
+	 * We need to do this separately for DELETE and INSERT because they happen
+	 * on different tables.
+	 */
+	ar_delete_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR DELETE
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_delete_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 ar_delete_trig_tcs);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (processReturning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1009,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1019,6 +1081,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1034,22 +1097,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run on a leaf partition, we will not have
+			 * partition tuple routing set up. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (proute == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip the insert
+			 * as well; otherwise, an UPDATE could cause an increase in the
+			 * total number of rows across all partitions, which is clearly
+			 * wrong.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by the
+			 * EvalPlanQual machinery, but for an UPDATE that we've translated
+			 * into a DELETE from this partition and an INSERT into some other
+			 * partition, that's not available, because CTID chains can't span
+			 * relation boundaries.  We mimic the semantics to a limited extent
+			 * by skipping the INSERT if the DELETE fails to find a tuple. This
+			 * ensures that two concurrent attempts to UPDATE the same tuple at
+			 * the same time can't turn one tuple into two, and that an UPDATE
+			 * of a just-deleted tuple can't resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * Updates set the transition capture map only when a new subplan
+			 * is chosen.  But for inserts, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INSERT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(tupconv_map,
+											  tuple,
+											  proute->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate, true);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1660,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1500,62 +1682,149 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		int			numResultRelInfos;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+		ExecSetupChildParentMap(mtstate,
+								(mtstate->mt_partition_tuple_routing != NULL));
+
+		/*
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
+		 */
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		numResultRelInfos = (proute != NULL ?
-							 proute->num_partitions :
-							 mtstate->mt_nplans);
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes:
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ *    result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tuples when the target table is a partitioned
+ *    table. For updates, we need to convert the tuple from the subplan result
+ *    rel to the target table descriptor, and for inserts, we need to convert
+ *    the inserted tuple from the leaf partition to the target table
+ *    descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf)
+{
+	ResultRelInfo *rootRelInfo = getASTriggerResultRelInfo(mtstate);
+	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+	TupleDesc	outdesc;
+	int			numResultRelInfos;
+	int			i;
 
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-		/* Choose the right set of partitions */
-		if (proute != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = proute->partitions;
+	/* If perleaf is true, partition tuple routing info has to be present */
+	Assert(!perleaf || proute != NULL);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	numResultRelInfos = (perleaf ? proute->num_partitions :
+								   mtstate->mt_nplans);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based on the
+		 * partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		resultRelInfos = proute->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static inline TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+		Assert(proute && proute->subplan_partition_offsets != NULL);
+		leaf_index = proute->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < proute->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1931,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2054,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1831,9 +2099,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partKeyUpdated;
 	PartitionTupleRouting *proute = NULL;
 	int			num_partitions = 0;
 
@@ -1908,6 +2179,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1945,15 +2226,32 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		proute = mtstate->mt_partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(mtstate,
 										   rel, node->nominalRelation,
 										   estate);
 		num_partitions = proute->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1964,6 +2262,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1993,26 +2302,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2021,17 +2333,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2048,7 +2369,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2084,22 +2405,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
@@ -2361,7 +2695,7 @@ ExecEndModifyTable(ModifyTableState *node)
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (node->mt_partition_tuple_routing)
-		ExecCleanupTupleRouting(node->mt_partition_tuple_routing);
+		ExecCleanupTupleRouting(node, node->mt_partition_tuple_routing);
 
 	/*
 	 * Free the exprcontext
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79..5e27d8c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partKeyUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2263,6 +2264,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(is_partition_key_update);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 30ccc9c..9461bb7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(is_partition_key_update);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df1..a067ba5 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2105,6 +2106,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partKeyUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2527,6 +2529,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(is_partition_key_update);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866..ea383cc 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partKeyUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 12a6ee4..f509359 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,7 +1364,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1403,7 +1403,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283..4ceaf17 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -279,6 +279,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2373,6 +2374,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partKeyUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6442,6 +6444,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partKeyUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6468,6 +6471,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partKeyUpdated = partKeyUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dad..66b8356 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partKeyUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partKeyUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partKeyUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6155,17 +6159,22 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets is_partition_key_update
+ *		to true if any of the root rte's updated columns is a partition key.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (is_partition_key_update)
+		*is_partition_key_update = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6173,6 +6182,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (is_partition_key_update)
+				*is_partition_key_update = pc->is_partition_key_update;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 95557d7..ef5ccf4 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1461,16 +1462,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		is_partition_key_update = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &is_partition_key_update);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1487,6 +1491,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->is_partition_key_update = is_partition_key_update;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1563,7 +1568,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *is_partition_key_update)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1578,6 +1584,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*is_partition_key_update)
+		*is_partition_key_update =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1617,7 +1634,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   is_partition_key_update);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 48b4db7..2df4a4c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3274,6 +3274,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partKeyUpdated' is true if any partitioning columns are being updated,
+ *		either from the target relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3287,6 +3289,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partKeyUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3354,6 +3357,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partKeyUpdated = partKeyUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index b5df357..aea71f5 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -67,6 +67,9 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * subplan_partition_offsets	int Array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -80,7 +83,9 @@ typedef struct PartitionTupleRouting
 	ResultRelInfo **partitions;
 	int			num_partitions;
 	TupleConversionMap **partition_tupconv_maps;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
@@ -90,6 +95,11 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern void ExecCleanupTupleRouting(PartitionTupleRouting *proute);
+extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot);
+extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
+						PartitionTupleRouting *proute);
 
 #endif							/* EXECPARTITION_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4bb5cb1..8b5391d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -991,8 +991,9 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5..a9e6d45 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8..9b2fd5f 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1674,6 +1674,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partKeyUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2124,6 +2125,9 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		is_partition_key_update;	/* is the partition key of any of
+											 * the partitioned tables
+											 * updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 725694f..ef7173f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -242,6 +242,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 997b91f..4445878 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *is_partition_key_update);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..ee7a75a 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,462 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+CREATE TABLE range_parted (
 	a text,
-	b int
-) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
-create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
+) PARTITION BY RANGE (a, b);
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
 -- ok
-update range_parted set b = b + 1 where b = 10;
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- ok, row movement , check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+DROP VIEW upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+                 (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+    return null;
+  end;
+$$;
+CREATE TRIGGER trans_updatetrig
+  AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+  AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+  AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, usually the
+-- conversion is not needed for inserted rows, because the original tuple is
+-- already compatible with the desired transition tuple format. But conversion
+-- happens when there is a BR trigger because the trigger can change the
+-- inserted row. So we require to install BR triggers on those child partitions
+-- where the rows are moved as part of update-row-movement.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+   NEW.b = NEW.b + 1;
+   return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,110,1,), (b,15,98,2,), (b,17,106,16,), (b,19,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,146,1,), (b,16,147,2,), (b,17,155,16,), (b,19,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+   NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- ok
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+  AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+  AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+  AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+  AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+  AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+  AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+  AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+  AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+  AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+  AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+  AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+  AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,21 +661,192 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
-create table list_parted (
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+CREATE TABLE list_parted (
 	a text,
 	b int
-) partition by list (a);
-create table list_part1  partition of list_parted for values in ('a', 'b');
-create table list_default partition of list_parted default;
-insert into list_part1 values ('a', 1);
-insert into list_default values ('d', 10);
+) PARTITION BY list (a);
+CREATE TABLE list_part1  PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
 -- fail
-update list_default set a = 'a' where a = 'd';
+UPDATE list_default set a = 'a' WHERE a = 'd';
 ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
-update list_default set a = 'x' where a = 'd';
+UPDATE list_default set a = 'x' WHERE a = 'd';
+DROP TABLE list_parted;
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Tests for BR UPDATE triggers changing the partition key.
+-----------
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b  = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+   return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+   for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+NOTICE:  Trigger: Got OLD row (2,70,1), but returning NULL
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+DROP FUNCTION func_parted_mod_b();
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+DROP TABLE non_parted;
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,14 +868,11 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok, row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
-drop table range_parted;
-drop table list_parted;
 drop table hash_parted;
 drop operator class custom_opclass using hash;
 drop function dummy_hashint4(a int4, seed int8);
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..f316446 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,330 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+
+CREATE TABLE range_parted (
 	a text,
-	b int
-) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
-create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
+) PARTITION BY RANGE (a, b);
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+
+:show_data;
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
+-- ok
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+-- ok, row movement , check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+
+:show_data;
+
+-- cleanup
+DROP VIEW upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+                 (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+    return null;
+  end;
+$$;
+
+CREATE TRIGGER trans_updatetrig
+  AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+  AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+  AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, usually the
+-- conversion is not needed for inserted rows, because the original tuple is
+-- already compatible with the desired transition tuple format. But conversion
+-- happens when there is a BR trigger because the trigger can change the
+-- inserted row. So we require to install BR triggers on those child partitions
+-- where the rows are moved as part of update-row-movement.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+   NEW.b = NEW.b + 1;
+   return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+   NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
 -- ok
-update range_parted set b = b + 1 where b = 10;
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+  AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+  AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+  AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+  AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+  AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+  AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+  AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+  AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+  AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+  AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+  AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+  AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+:show_data;
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
+
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,19 +439,121 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
-create table list_parted (
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+
+CREATE TABLE list_parted (
 	a text,
 	b int
-) partition by list (a);
-create table list_part1  partition of list_parted for values in ('a', 'b');
-create table list_default partition of list_parted default;
-insert into list_part1 values ('a', 1);
-insert into list_default values ('d', 10);
+) PARTITION BY list (a);
+CREATE TABLE list_part1  PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
 
 -- fail
-update list_default set a = 'a' where a = 'd';
+UPDATE list_default set a = 'a' WHERE a = 'd';
 -- ok
-update list_default set a = 'x' where a = 'd';
+UPDATE list_default set a = 'x' WHERE a = 'd';
+
+DROP TABLE list_parted;
+
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+
+
+-----------
+-- Tests for BR UPDATE triggers changing the partition key.
+-----------
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b  = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+   return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+   for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+DROP FUNCTION func_parted_mod_b();
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+DROP TABLE non_parted;
+
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
 
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
@@ -169,13 +576,12 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok, row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 
 -- cleanup
-drop table range_parted;
-drop table list_parted;
 drop table hash_parted;
 drop operator class custom_opclass using hash;
 drop function dummy_hashint4(a int4, seed int8);

#231

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: Amit Khandekar (#230)

Re: [HACKERS] UPDATE of partition key

Thanks for making those changes.

On 11 January 2018 at 04:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Yes, I understand that there won't be any update scan plans. But, with
the modifications done in ExecInitModifyTable(), I wanted to run that
code with this scenario where there are no partitions, to make sure it
does not behave weirdly or crash. Any suggestions for comments, given
this perspective ? For now, I have made the comment this way:

-- Check that partition-key UPDATE works sanely on a partitioned table
that does not have any child partitions.

Sounds good.

18. Why two RESET SESSION AUTHORIZATIONs?

reset session authorization;
drop trigger trig_d_1_15 ON part_d_1_15;
drop function func_d_1_15();
-- Policy expression contains SubPlan
reset session authorization;

The second reset is actually in a different paragraph. The reason it's
there is to ensure we have reset it regardless of the earlier cleanup.

hmm, I was reviewing the .out file, which does not have the empty
lines. Still seems a bit surplus.

Attached v35 patch. Thanks.

Thanks. I'll try to look at it soon.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#232

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: David Rowley (#231)

Re: [HACKERS] UPDATE of partition key

On 11 January 2018 at 10:44, David Rowley <david.rowley@2ndquadrant.com> wrote:

18. Why two RESET SESSION AUTHORIZATIONs?

reset session authorization;
drop trigger trig_d_1_15 ON part_d_1_15;
drop function func_d_1_15();
-- Policy expression contains SubPlan
reset session authorization;

The second reset is actually in a different paragraph. The reason it's
there is to ensure we have reset it regardless of the earlier cleanup.

hmm, I was reviewing the .out file, which does not have the empty
lines. Still seems a bit surplus.

I believe the output file does not have the blank lines present in the
.sql file. I was referring to the paragraph in the *.sql* file.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#233

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Robert Haas (#228)

Re: [HACKERS] UPDATE of partition key

On 9 January 2018 at 23:07, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 4, 2018 at 1:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

------------------
1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
------------------

It seems like the ON CONFLICT stuff handled that by adding a
second TransitionCaptureState pointer to ModifyTable, thus
mt_transition_capture and mt_oc_transition_capture. By that
precedent, we could add mt_utr_transition_capture or similar, and
maybe that's the way to go. It seems a bit unsatisfying, but so does
what you have now.

In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

I will see if it turns out better to have two tcs_maps in
TransitionCaptureState, one for update and one for insert. But this,
on first look, does not look good.

Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
insert_tcs_maps for UPDATE/DELETE and INSERT events respectively.

That's not what I suggested. If you look at what I wrote, I floated
the idea of having two TransitionCaptureStates, not two separate maps
within the same TransitionCaptureState.

In the first paragraph of my explanation, I was explaining why two
Transition capture states does not look like a good idea to me :

In case of ON CONFLICT, if there are both INSERT and UPDATE statement
triggers referencing transition tables, both of the triggers need to
independently populate their own transition tables, and hence the need
for two separate transition states : mt_transition_capture and
mt_oc_transition_capture. But in case of update-tuple-routing, the
INSERT statement trigger won't come into picture. So the same
mt_transition_capture can serve the purpose of populating the
transition table with OLD and NEW rows. So I think it would be too
redundant, if not incorrect, to have a whole new transition state for
update tuple routing.

And in the next para, I explained about the other alternative of
having two separate maps as against transition states.

------------------
2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
------------------
------------------
3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps
------------------

We need to change it's name because now this map is not only used for
transition capture, but also for update-tuple-routing. Does it look ok
for you if, for readability, we keep the childparent tag ? Or else, we
can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
looks more informative.

I see your point: the array is being renamed because it now has more
than one purpose. But that's also what I'm complaining about with
regard to point #2: the same array is being used for more than one
purpose. That's generally bad style. If you have two loops in a
function, it's best to declare two separate loop variables rather than
reusing the same variable. This lets the compiler detect, for
example, an error where the second loop variable is used before it's
initialized, which would be undetectable if you reused the same
variable in both places. Although that particular benefit doesn't
pertain in this case, I maintain that having a single structure member
that is indexed one of two different ways is a bad idea.

If I understand correctly, the way we got here is that, in earlier
patch versions, you had two arrays of maps, but it wasn't clear why we
needed both of them, and David suggested replacing one of them with an
array of indexes instead, in the hopes of reducing confusion.

Slight correction; it was suggested by Amit Langote; not by David.

However, it looks to me like that didn't really work out. If we
always needed both maps, or even if we always needed the per-leaf map,
it would have been a good idea, but it seems here that we can need
either the per-leaf map or the per-subplan map or both or neither, and
we want to avoid computing all of the per-leaf conversion maps if we
only need per-subplan access.

I was ok with either mine or Amit Langote's approach. His approach
uses array of offsets to leaf-partition array, which sounded to me
like it may be re-usable for some similar purpose later.

I think one way to fix this might be to build the per-leaf maps on
demand. Just because we're doing UPDATE tuple routing doesn't
necessarily mean we'll actually need a TupleConversionMap for every
child. So we could allocate an array with one byte per leaf, where 0
means we don't know whether tuple conversion is necessary, 1 means it
is not, and 2 means it is, or something like that. Then we have a
second array with conversion maps. We provide a function
tupconv_map_for_leaf() or similar that checks the array; if it finds
1, it returns NULL; if it finds 2, it returns the conversion map
previously calculated. If it finds 0, it calls convert_tuples_by_name,
caches the result for later, updates the one-byte-per-leaf array with
the appropriate value, and returns the just-computed conversion map.
(The reason I'm suggesting 0/1/2 instead of just true/false is to
reduce cache misses; if we find a 1 in the first array we don't need
to access the second array at all.)

If that doesn't seem like a good idea for some reason, then my second
choice would be to leave mt_transition_tupconv_maps named the way it
is currently and have a separate mt_update_tupconv_maps, with the two
pointing, if both are initialized and as far as possible, to the same
TupleConversionMap objects.

So there are two independent optimizations we are talking about :

1. Create the map only when needed. We may not require a map for a
leaf partition if there is no insert happening to that partition. And,
the insert may be part of update-tuple-routing or a plain INSERT
tuple-routing. Also, we may not require map for *every* subplan. It
may happen that many of the update subplans do not return any tuples,
in which case we don't require the maps for the partitions
corresponding to those subplans. This optimization was also suggested
by Thomas Munro initially.

2. In case of UPDATE, for partitions that take part in update scans,
there should be a single map; there should not be two separate maps,
one for accessing per-subplan and the other for accessing per-leaf. My
approach for this was to have a per-leaf array and a per-subplan
array, but they should share the maps wherever possible. I think this
is what you are suggesting in your second choice. The other approach
is as suggested by Amit Langote (which is present in the latest
versions of the patch), where we have an array of maps, and a
subplan-offsets array.

So your preference is for #1. But I think this optimization is not
specific for update-tuple-routing. This was applicable for inserts
also, from the beginning. And we can do this on-demand stuff for
subplan maps also.

Both optimizations are good, and they are independently required. But
I think optimization#2 is purely relevant to update-tuple-routing, so
we should do it now. We can do optimization #1 as a general
optimization, over and above optimization #2. So my opinion is, we do
#1 not as part of update-tuple-routing patch.

For optimization#2 (i.e. your second choice), I can revert back to the
way I had earlier used two different arrays, with per-leaf array
re-using the per-subplan maps.

Let me know if you are ok with this plan.

Then later once we do optimization #1, the maps will not be just
shared between per-subplan and per-leaf arrays, they will also be
created only when required.

Regarding the array names ...

Regardless of any approach, we are going to require two array maps,
one is per-subplan, and the other per-leaf. Now, for transition
capture, we would require both of these maps: per-subplan for
capturing updated rows, and per-leaf for routed rows. And during
update-tuple-routing, for converting the tuple from source partition
to root partition, we require only per-subplan map.

So if we name the per-subplan map as mt_transition_tupconv_maps, it
implies the per-leaf map is not used for transition capture, which is
incorrect. Similar thing, if we name the per-leaf map as
mt_transition_tupconv_maps.

Update-tuple-routing uses only per-subplan map. So per-subplan map can
be named mt_update_tupconv_maps. But again, how can we name the
per-leaf map ?

Noting all this, I feel we can go with names according to the
structure of maps. Something like : mt_perleaf_tupconv_maps, and
mt_persubplan_tupconv_maps. Other suggestions welcome.

-------------------
4. Explicit signaling for "we are only here for transition tables"
-------------------

I had given a thought on this earlier. I felt, even the pre-existing
conditions like "!trigdesc->trig_update_after_row" are all indirect
ways to determine that this function is called only to capture
transition tables, and thought that it may have been better to have
separate parameter transition_table_only.

I see your point. I guess it's not really this patch's job to solve
this problem, although I think this is going to need some refactoring
in the not-too-distant future. So I think the way you did it is
probably OK.

Instead of adding another parameter to AfterTriggerSaveEvent(), I had
also considered another approach: Put the transition-tuples-capture
logic part of AfterTriggerSaveEvent() into a helper function
CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
of calling ExecARUpdateTriggers(), call this function
CaptureTransitionTables(). I then dropped this idea and thought rather
to call ExecARUpdateTriggers() which neatly does the required checks
and other things like locking the old tuple via GetTupleForTrigger().
So if we go by CaptureTransitionTables(), we would need to do what
ExecARUpdateTriggers() does before calling CaptureTransitionTables().
This is doable. If you think this is worth doing so as to get rid of
the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.

Duplicating logic elsewhere to avoid this problem here doesn't seem
like a good plan.

Yeah, ok.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#234

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Amit Khandekar (#233)

Re: [HACKERS] UPDATE of partition key

On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

In the first paragraph of my explanation, I was explaining why two
Transition capture states does not look like a good idea to me :

Oh, sorry. I didn't read what you wrote carefully enough, I guess.

I see your points. I think that there is probably a general need for
some refactoring here. AfterTriggerSaveEvent() got significantly more
complicated and harder to understand with the arrival of transition
tables, and this patch is adding more complexity still. It's also
adding complexity in other places to make ExecInsert() and
ExecDelete() usable for the semi-internal DELETE/INSERT operations
being produced when we split a partition key update into a DELETE and
INSERT pair. It would be awfully nice to have some better way to
separate out each of the different things we might or might not want
to do depending on the situation: capture old tuple, capture new
tuple, fire before triggers, fire after triggers, count processed
rows, set command tag, perform actual heap operation, update indexes,
etc. However, I don't have a specific idea how to do it better, so
maybe we should just get this committed for now and perhaps, with more
eyes on the code, someone will have a good idea.

Slight correction; it was suggested by Amit Langote; not by David.

Oh, OK, sorry.

So there are two independent optimizations we are talking about :

1. Create the map only when needed.
2. In case of UPDATE, for partitions that take part in update scans,
there should be a single map; there should not be two separate maps,
one for accessing per-subplan and the other for accessing per-leaf.

These optimizations aren't completely independent. Optimization #2
can be implemented in several different ways. The way you've chosen
to do it is to index the same array in two different ways depending on
whether per-leaf indexing is not needed, which I think is
unacceptable. Another approach, which I proposed upthread, is to
always built the per-leaf mapping, but you pointed out that this could
involve doing a lot of unnecessary work in the case where most leaves
were pruned. However, if you also implement #1, then that problem
goes away. In other words, depending on the design you choose for #2,
you may or may not need to also implement optimization #1 to get good
performance.

To put that another way, I think Amit's idea of keeping a
subplan-offsets array is a pretty good one. From your comments, you
do too. But if we want to keep that, then we need a way to avoid the
expense of populating it for leaves that got pruned, except when we
are doing update row movement. Otherwise, I don't see much choice but
to jettison the subplan-offsets array and just maintain two separate
arrays of mappings.

Regarding the array names ...

Noting all this, I feel we can go with names according to the
structure of maps. Something like : mt_perleaf_tupconv_maps, and
mt_persubplan_tupconv_maps. Other suggestions welcome.

I'd probably do mt_per_leaf_tupconv_maps, since inserting an
underscore between some but not all words seems strange. But OK
otherwise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#235

Amit Khandekar

amitdkhan.pg@gmail.com

about 8 years ago

In reply to: Robert Haas (#234)

Re: [HACKERS] UPDATE of partition key

On 12 January 2018 at 01:18, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

In the first paragraph of my explanation, I was explaining why two
Transition capture states does not look like a good idea to me :

Oh, sorry. I didn't read what you wrote carefully enough, I guess.

I see your points. I think that there is probably a general need for
some refactoring here. AfterTriggerSaveEvent() got significantly more
complicated and harder to understand with the arrival of transition
tables, and this patch is adding more complexity still. It's also
adding complexity in other places to make ExecInsert() and
ExecDelete() usable for the semi-internal DELETE/INSERT operations
being produced when we split a partition key update into a DELETE and
INSERT pair. It would be awfully nice to have some better way to
separate out each of the different things we might or might not want
to do depending on the situation: capture old tuple, capture new
tuple, fire before triggers, fire after triggers, count processed
rows, set command tag, perform actual heap operation, update indexes,
etc. However, I don't have a specific idea how to do it better, so
maybe we should just get this committed for now and perhaps, with more
eyes on the code, someone will have a good idea.

Slight correction; it was suggested by Amit Langote; not by David.

Oh, OK, sorry.

So there are two independent optimizations we are talking about :

1. Create the map only when needed.
2. In case of UPDATE, for partitions that take part in update scans,
there should be a single map; there should not be two separate maps,
one for accessing per-subplan and the other for accessing per-leaf.

These optimizations aren't completely independent. Optimization #2
can be implemented in several different ways. The way you've chosen
to do it is to index the same array in two different ways depending on
whether per-leaf indexing is not needed, which I think is
unacceptable. Another approach, which I proposed upthread, is to
always built the per-leaf mapping, but you pointed out that this could
involve doing a lot of unnecessary work in the case where most leaves
were pruned. However, if you also implement #1, then that problem
goes away. In other words, depending on the design you choose for #2,
you may or may not need to also implement optimization #1 to get good
performance.

To put that another way, I think Amit's idea of keeping a
subplan-offsets array is a pretty good one. From your comments, you
do too. But if we want to keep that, then we need a way to avoid the
expense of populating it for leaves that got pruned, except when we
are doing update row movement. Otherwise, I don't see much choice but
to jettison the subplan-offsets array and just maintain two separate
arrays of mappings.

Ok. So giving more thought on our both's points, here's what I feel we
can do ...

With the two arrays mt_per_leaf_tupconv_maps and
mt_per_subplan_tupconv_maps, we want the following things :
1. Create the map on-demand.
2. If possible, try to share the maps between the per-subplan and
per-leaf arrays.

For this, option 1 is :

-------

Both the arrays elements are made of this structure :

typedef struct TupleConversionMapInfo
{
uint8 map_required; /* 0 : Not known if map is required */
/* 1 : map is created/required */
/* 2 : map is not necessary */
TupleConversionMap *map;
} TupleConversionMapInfo;

Arrays look like this :
TupleConversionMapInfo mt_per_subplan_tupconv_maps[];
TupleConversionMapInfo mt_per_leaf_tupconv_maps[];

When a per-subplan array is to be accessed at index i, a macro
get_tupconv_map(mt_per_subplan_tupconv_maps, i, forleaf=false) will be
called. This will create a new map if necessary, populate the array
element fields, and it will also copy this info into a corresponding
array element in the per-leaf array. To get to the per-leaf array
element, we need a subplan-offsets array. Whereas, if the per-leaf
array element is already populated, this info will be copied into the
subplan element in the opposite direction.

When a per-leaf array is to be accessed at index i,
get_tupconv_map(mt_per_leaf_tupconv_maps, i, forleaf=true) will be
called. Here, it will similarly update the per-leaf array element. But
it will not try to access the corresponding per-subplan array because
we don't have such offset array.

This is how the macro will look like :

#define get_tupconv_map(mapinfo, i, perleaf)
((mapinfo[i].map_required == 2) ? NULL :
((mapinfo[i].map_required == 1) ? mapinfo[i].map :
create_new_map(mapinfo, i, perleaf)))

where create_new_map() will take care of populating the array element
on both the arrays, and then return the map if created, or NULL if not
required.

-------

Option 2 :

Elements of both arrays are pointers to TupleConversionMapInfo structure.
Arrays look like this :
TupleConversionMapInfo *mt_per_subplan_tupconv_maps[];
TupleConversionMapInfo *mt_per_leaf_tupconv_maps[];

typedef struct TupleConversionMapInfo
{
uint8 map_required; /* 0 : map is not required, 1 : ... */
TupleConversionMap *map;
}

So in ExecInitModifyTable(), for each of the array elements of both
arrays, we palloc TupleConversionMap structure, and wherever
applicable, a common palloc'ed structure is shared between the two
arrays. This way, subplan-offsets array is not required.

In this case, the macro get_tupconv_map() similarly populates the
structure, but it does not have to access the other map array, because
the structures are already shared in the two arrays.

The problem with this option is : since we have to share some of the
structures allocated by the array elements, we have to build the two
arrays together, but in the code the arrays are to be allocated when
required at different points, like update_tuple_routing required and
transition tables required. Also, beforehand we have to individually
palloc memory for TupleConversionMapInfo for all the array elements,
as against allocating memory in a single palloc of the whole array as
in option 1.

As of this writing, I am writing code relevant to adding the on-demand
logic, and I anticipate option 1 would turn out better than option 2.
But I would like to know if you are ok with both of these options.

------------

The reason why I am having map_required field inside a structure along
with the map, as against a separate array, is so that we can do the
on-demand allocation for both per-leaf array and per-subplan array.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#236

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#235)

Re: [HACKERS] UPDATE of partition key

On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The reason why I am having map_required field inside a structure along
with the map, as against a separate array, is so that we can do the
on-demand allocation for both per-leaf array and per-subplan array.

Putting the map_required field inside the structure with the map makes
it completely silly to do the 0/1/2 thing, because the whole structure
is going to be on the same cache line anyway. It won't save anything
to access the flag instead of a pointer in the same struct. Also,
the uint8 will be followed by 7 bytes of padding, because the pointer
that follows will need to begin on an 8-byte boundary (at least, on
64-bit machines), so this will use more memory.

What I suggest is:

#define MT_CONVERSION_REQUIRED_UNKNOWN 0
#define MT_CONVERSION_REQUIRED_YES 1
#define MT_CONVERSION_REQUIRED_NO 2

In ModifyTableState:

uint8 *mt_per_leaf_tupconv_required;
TupleConversionMap **mt_per_leaf_tupconv_maps;

In PartitionTupleRouting:

int *subplan_partition_offsets;

When you initialize the ModifyTableState, do this:

mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) *
numResultRelInfos);
mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap
*) * numResultRelInfos);

When somebody needs a map, then

(1) if they need it by subplan index, first use
subplan_partition_offsets to convert it to a per-leaf index

(2) then write a function that takes the per-leaf index and does this:

switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index])
{
case MT_CONVERSION_REQUIRED_UNKNOWN:
map = convert_tuples_by_name(...);
if (map == NULL)
mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_NO;
else
{
mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_YES;
mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map;
}
return map;
case MT_CONVERSION_REQUIRED_YES:
return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index];
case MT_CONVERSION_REQUIRED_NO:
return NULL;
}

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#237

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Robert Haas (#236)

Re: [HACKERS] UPDATE of partition key

On 12 January 2018 at 20:24, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The reason why I am having map_required field inside a structure along
with the map, as against a separate array, is so that we can do the
on-demand allocation for both per-leaf array and per-subplan array.

Putting the map_required field inside the structure with the map makes
it completely silly to do the 0/1/2 thing, because the whole structure
is going to be on the same cache line anyway. It won't save anything
to access the flag instead of a pointer in the same struct.

I see. Got it.

Also,
the uint8 will be followed by 7 bytes of padding, because the pointer
that follows will need to begin on an 8-byte boundary (at least, on
64-bit machines), so this will use more memory.

What I suggest is:

#define MT_CONVERSION_REQUIRED_UNKNOWN 0
#define MT_CONVERSION_REQUIRED_YES 1
#define MT_CONVERSION_REQUIRED_NO 2

In ModifyTableState:

uint8 *mt_per_leaf_tupconv_required;
TupleConversionMap **mt_per_leaf_tupconv_maps;

In PartitionTupleRouting:

int *subplan_partition_offsets;

When you initialize the ModifyTableState, do this:

mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) *
numResultRelInfos);
mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap
*) * numResultRelInfos);

A few points below where I wanted to confirm that we are on the same page ...

When somebody needs a map, then

(1) if they need it by subplan index, first use
subplan_partition_offsets to convert it to a per-leaf index

Before that, we need to check if there *is* an offset array. If there
are no partitions, there is only going to be a per-subplan array,
there won't be an offsets array. But I guess, you are saying : "do the
on-demand allocation only for leaf partitions; if there are no
partitions, the per-subplan maps will always be allocated for each of
the subplans from the beginning" . So if there is no offset array,
just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
without any further checks.

(2) then write a function that takes the per-leaf index and does this:

switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index])
{
case MT_CONVERSION_REQUIRED_UNKNOWN:
map = convert_tuples_by_name(...);
if (map == NULL)
mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_NO;
else
{
mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_YES;
mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map;
}
return map;
case MT_CONVERSION_REQUIRED_YES:
return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index];
case MT_CONVERSION_REQUIRED_NO:
return NULL;
}

Yeah, right.

But after that, I am not sure then why is mt_per_sub_plan_maps[] array
needed ? We are always going to convert the subplan index into leaf
index, so per-subplan map array will not come into picture. Or are you
saying, it will be allocated and used only when there are no
partitions ? From one of your earlier replies, you did mention about
trying to share the maps between the two arrays, that means you were
considering both arrays being used at the same time.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#238

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#237)

Re: [HACKERS] UPDATE of partition key

On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

(1) if they need it by subplan index, first use
subplan_partition_offsets to convert it to a per-leaf index

Before that, we need to check if there *is* an offset array. If there
are no partitions, there is only going to be a per-subplan array,
there won't be an offsets array. But I guess, you are saying : "do the
on-demand allocation only for leaf partitions; if there are no
partitions, the per-subplan maps will always be allocated for each of
the subplans from the beginning" . So if there is no offset array,
just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
without any further checks.

Oops. I forgot that there might not be partitions. I was assuming
that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd
always use subplan_partition_offsets. Both that won't work in the
inheritance case.

But after that, I am not sure then why is mt_per_sub_plan_maps[] array
needed ? We are always going to convert the subplan index into leaf
index, so per-subplan map array will not come into picture. Or are you
saying, it will be allocated and used only when there are no
partitions ? From one of your earlier replies, you did mention about
trying to share the maps between the two arrays, that means you were
considering both arrays being used at the same time.

We'd use them both at the same time if we didn't have, or didn't use,
subplan_partition_offsets, but if we have subplan_partition_offsets
and can use it then we don't need mt_per_sub_plan_maps.

I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
there are no partitions, but not use it when partitions are present.
What do you think about that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#239

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Robert Haas (#238)

Re: [HACKERS] UPDATE of partition key

On 13 January 2018 at 02:56, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

(1) if they need it by subplan index, first use
subplan_partition_offsets to convert it to a per-leaf index

Before that, we need to check if there *is* an offset array. If there
are no partitions, there is only going to be a per-subplan array,
there won't be an offsets array. But I guess, you are saying : "do the
on-demand allocation only for leaf partitions; if there are no
partitions, the per-subplan maps will always be allocated for each of
the subplans from the beginning" . So if there is no offset array,
just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
without any further checks.

Oops. I forgot that there might not be partitions. I was assuming
that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd
always use subplan_partition_offsets. Both that won't work in the
inheritance case.

But after that, I am not sure then why is mt_per_sub_plan_maps[] array
needed ? We are always going to convert the subplan index into leaf
index, so per-subplan map array will not come into picture. Or are you
saying, it will be allocated and used only when there are no
partitions ? From one of your earlier replies, you did mention about
trying to share the maps between the two arrays, that means you were
considering both arrays being used at the same time.

We'd use them both at the same time if we didn't have, or didn't use,
subplan_partition_offsets, but if we have subplan_partition_offsets
and can use it then we don't need mt_per_sub_plan_maps.

I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
there are no partitions, but not use it when partitions are present.
What do you think about that?

Even where partitions are present, in the usual case where there are
no transition tables we won't require per-leaf map at all [1]For update-tuple-routing, only per-subplan access is required; - For transition tables, per-subplan access is required, and additionally per-leaf access is required when tuples are update-routed - So if both update-tuple-routing and transition tables are required, both of the maps are needed.. So I
think we should keep mt_per_sub_plan_maps only for the case where
per-leaf map is not allocated. And we will not allocate
mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words,
exactly one of the two maps will be allocated.

This is turning out to be close to what's already there in the last
patch versions: use a single map array, and an offsets array. The
difference is : in the patch I am using the *same* variable for the
two maps. Where as, now we are talking about two different array
variables for maps, but only allocating one of them.

Are you ok with this ? I think the thing you were against was to have
a common *variable* for two purposes. But above, I am saying we have
two variables but assign a map array to only *one* of them and leave
the other unused.

---------

Regarding the on-demand map allocation ....
Where mt_per_sub_plan_maps is allocated, we won't have the on-demand
allocation: all the maps will be allocated initially. The reason is
becaues the map_is_required array is only per-leaf. Or else, again, we
need to keep another map_is_required array for per-subplan. May be we
can support the on-demand stuff for subplan maps also, but only as a
separate change after we are done with update-partition-key.

---------

Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a
bool array, and name it : mt_per_leaf_map_not_required. When it is
true for a given index, it means, we have already called
convert_tuples_by_name() and it returned NULL; i.e. it means we are
sure that map is not required. A false value means we need to call
convert_tuples_by_name() if it is NULL, and then set
mt_per_leaf_map_not_required to (map == NULL).

Instead of a bool array, we can even make it a Bitmapset. But I think
access would become slower as compared to array, particularly because
it is going to be a heavily used function.

---------

[1]: For update-tuple-routing, only per-subplan access is required; - For transition tables, per-subplan access is required, and additionally per-leaf access is required when tuples are update-routed - So if both update-tuple-routing and transition tables are required, both of the maps are needed.
- For transition tables, per-subplan access is required,
and additionally per-leaf access is required when tuples are
update-routed
- So if both update-tuple-routing and transition tables are
required, both of the maps are needed.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#240

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Robert Haas (#229)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 10 January 2018 at 02:30, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 5, 2018 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 5, 2018 at 7:12 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The above patch is to be applied over the last remaining preparatory
patch, now named (and attached) :
0001-Refactor-CheckConstraint-related-code.patch

Committed that one, too.

Some more comments on the main patch:

I don't really like the fact that ExecCleanupTupleRouting() now takes
a ModifyTableState as an argument, particularly because of the way
that is using that argument. To figure out whether a ResultRelInfo
was pre-existing or one it created, it checks whether the pointer
address of the ResultRelInfo is >= mtstate->resultRelInfo and <
mtstate->resultRelInfo + mtstate->mt_nplans. However, that means that
ExecCleanupTupleRouting() ends up knowing about the memory allocation
pattern used by ExecInitModifyTable(), which seems like a slightly
dangerous amount of action at a distance. I think it would be better
for the PartitionTupleRouting structure to explicitly indicate which
ResultRelInfos should be closed, for example by storing a Bitmapset
*input_partitions. (Here, by "input", I mean "provided from the
mtstate rather than created by the PartitionTupleRouting structure;
other naming suggestions welcome.) When
ExecSetupPartitionTupleRouting latches onto a partition, it can do
proute->input_partitions = bms_add_member(proute->input_partitons, i).
In ExecCleanupTupleRouting, it can do if
(bms_is_member(proute->input_partitions, i)) continue.

Did the changes. But, instead of a new bitmapet, I used the offset
array for the purpose. As per our parallel discussion on
tup-conversion maps, it is almost finalized that the subplan-partition
offset map is good to have. So I have used that offset array to
determine whether a partition is present in the subplan. I used the
assumption that subplan and partition array have their partitions in
the same order.

We have a test, in the regression test suite for file_fdw, which
generates the message "cannot route inserted tuples to a foreign
table". I think we should have a similar test for the case where an
UPDATE tries to move a tuple from a regular partition to a foreign
table partition.

Added an UPDATE scenario in contrib/file_fdw/input/file_fdw.source.

I'm not sure if it should fail with the same error
or a different one, but I think we should have a test that it fails
cleanly and with a nice error message of some sort.

The update-tuple-routing goes through the same ExecInsert() code, so
it fails at the same place with the same error message.

The comment for get_partitioned_child_rels() claims that it sets
is_partition_key_update, but it really sets *is_partition_key_update.
And I think instead of "is a partition key" it should say "is used in
the partition key either of the relation whose RTI is specified or of
any child relation." I propose "used in" instead of "is" because
there can be partition expressions, and the rest is to clarify that
child partition keys matter.

Fixed.

create_modifytable_path uses partColsUpdated rather than
partKeyUpdated, which actually seems like better terminology. I
propose partKeyUpdated -> partColsUpdated everywhere. Also, why use
is_partition_key_update for basically the same thing in some other
places? I propose changing that to partColsUpdated as well.

Done.

The capitalization of the first comment hunk in execPartition.h is strange.

I think you are referring to :
* subplan_partition_offsets int Array ordered by UPDATE subplans. Each
Changed Array to array. Didn't change UPDATE.

Attached v36 patch.

Attachments:

update-partition-key_v36.patchapplication/octet-stream; name=update-partition-key_v36.patchDownload

diff --git a/contrib/file_fdw/input/file_fdw.source b/contrib/file_fdw/input/file_fdw.source
index e6821d6..88cb5f2 100644
--- a/contrib/file_fdw/input/file_fdw.source
+++ b/contrib/file_fdw/input/file_fdw.source
@@ -178,6 +178,7 @@ SELECT tableoid::regclass, * FROM p1;
 SELECT tableoid::regclass, * FROM p2;
 INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
 INSERT INTO pt VALUES (2, 'xyzzy');
+UPDATE pt set a = 1 where a = 2; -- ERROR
 SELECT tableoid::regclass, * FROM pt;
 SELECT tableoid::regclass, * FROM p1;
 SELECT tableoid::regclass, * FROM p2;
diff --git a/contrib/file_fdw/output/file_fdw.source b/contrib/file_fdw/output/file_fdw.source
index 709c43e..e07bb24 100644
--- a/contrib/file_fdw/output/file_fdw.source
+++ b/contrib/file_fdw/output/file_fdw.source
@@ -344,6 +344,8 @@ SELECT tableoid::regclass, * FROM p2;
 INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
 ERROR:  cannot route inserted tuples to a foreign table
 INSERT INTO pt VALUES (2, 'xyzzy');
+UPDATE pt set a = 1 where a = 2; -- ERROR
+ERROR:  cannot route inserted tuples to a foreign table
 SELECT tableoid::regclass, * FROM pt;
  tableoid | a |   b   
 ----------+---+-------
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a4..6d97f26 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried out the
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..296e301 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,16 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations"/>.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..8f83e6a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by an <command>INSERT</command> into
+    the new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and an <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6bfca2a..d869ac5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2587,7 +2587,6 @@ CopyFrom(CopyState cstate)
 		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
-			TupleConversionMap *map;
 			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
 
 			/*
@@ -2668,23 +2667,10 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = proute->partition_tupconv_maps[leaf_part_index];
-			if (map)
-			{
-				Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-				tuple = do_convert_tuple(tuple, map);
-
-				/*
-				 * We must use the partition's tuple descriptor from this
-				 * point on.  Use a dedicated slot from this point on until
-				 * we're finished dealing with the partition.
-				 */
-				slot = proute->partition_tuple_slot;
-				Assert(slot != NULL);
-				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-			}
+			tuple = ConvertPartitionTupleSlot(proute->partition_tupconv_maps[leaf_part_index],
+											  tuple,
+											  proute->partition_tuple_slot,
+											  &slot);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 1c488c3..e8af18e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	to the partition-key being changed, then this function is called once when
+ *	the row is deleted (to capture OLD row), and once when the row is inserted
+ *	into another partition (to capture NEW row).  This is done separately because
+ *	DELETE and INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE events fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for a row being inserted,
+		 * whereas newtup is NULL when the event is for a row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,18 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so we expect exactly one of them
+		 * to be NULL.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 8c0d2df..5100d82 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -54,7 +54,11 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL,
+				  *update_rri = NULL;
+	int			num_update_rri = 0,
+				update_rri_index = 0;
+	bool		is_update = false;
 	PartitionTupleRouting *proute;
 
 	/*
@@ -73,6 +77,52 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		(TupleConversionMap **) palloc0(proute->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	/* Initialization specific to update */
+	if (mtstate && mtstate->operation == CMD_UPDATE)
+	{
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+		is_update = true;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, update_rri_index should be set to the first
+		 * per-subplan result rel (i.e. 0), and then should be shifted as we
+		 * find them one by one while scanning the leaf partition oids. (It is
+		 * already set to 0 during initialization, above).
+		 */
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		proute->subplan_partition_offsets =
+			palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		proute->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(proute->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -81,20 +131,67 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 */
 	proute->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(proute->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				proute->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = &leaf_part_arr[i];
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in proute->partitions are
-		 * eventually closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * proute->partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
@@ -105,14 +202,10 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 			convert_tuples_by_name(tupDesc, part_tupdesc,
 								   gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for an INSERT.  An UPDATE
+		 * of a partition-key becomes a DELETE+INSERT operation, so this check
+		 * is still required when the operation is CMD_UPDATE.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -132,10 +225,16 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		proute->partitions[i] = leaf_part_rri++;
+		proute->partitions[i] = leaf_part_rri;
 		i++;
 	}
 
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
+
 	return proute;
 }
 
@@ -259,6 +358,37 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 }
 
 /*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+HeapTuple
+ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
+/*
  * ExecCleanupTupleRouting -- Clean up objects allocated for partition tuple
  * routing.
  *
@@ -268,6 +398,7 @@ void
 ExecCleanupTupleRouting(PartitionTupleRouting * proute)
 {
 	int			i;
+	int			subplan_index;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -284,15 +415,34 @@ ExecCleanupTupleRouting(PartitionTupleRouting * proute)
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
-	for (i = 0; i < proute->num_partitions; i++)
+	for (subplan_index = i = 0; i < proute->num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
+		/*
+		 * If this result rel is one of the UPDATE subplan result rels, let
+		 * ExecEndPlan() close it. For INSERT or COPY,
+		 * proute->subplan_partition_offsets will always be NULL. Note that the
+		 * subplan_partition_offsets array and the partitions array have the
+		 * partitions in the same order. So, while we iterate over partitions
+		 * array, we also iterate over the subplan_partition_offsets array in
+		 * order to get to know which of the result rels are present in the
+		 * UPDATE subplans.
+		 */
+		if (proute->subplan_partition_offsets &&
+			proute->subplan_partition_offsets[subplan_index] == i)
+		{
+			subplan_index++;
+			continue;
+		}
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (proute->root_tuple_slot)
+		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 	if (proute->partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 55dff5b..5f1c51f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,9 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static void ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf);
+static inline TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -266,6 +269,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *ar_insert_trig_tcs;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,7 +287,6 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -332,8 +335,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
+				Assert(mtstate->mt_is_tupconv_perpart);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
 			}
 			else
 			{
@@ -346,30 +351,20 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
+			Assert(mtstate->mt_is_tupconv_perpart);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = proute->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = proute->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(proute->partition_tupconv_maps[leaf_part_index],
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -450,6 +445,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -467,14 +463,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -623,9 +626,33 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tuples, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE.)  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	ar_insert_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR INSERT
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_insert_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 ar_insert_trig_tcs);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +706,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tupleDeleted,
+		   bool processReturning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +715,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *ar_delete_trig_tcs;
+
+	if (tupleDeleted)
+		*tupleDeleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +883,40 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform the caller about the same */
+	if (tupleDeleted)
+		*tupleDeleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE, but only if we are capturing transition tuples.
+	 * We need to do this separately for DELETE and INSERT because they happen
+	 * on different tables.
+	 */
+	ar_delete_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR DELETE
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_delete_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 ar_delete_trig_tcs);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (processReturning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1009,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1019,6 +1081,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1034,22 +1097,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run on a leaf partition, we will not have
+			 * partition tuple routing set up. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (proute == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip the insert
+			 * as well; otherwise, an UPDATE could cause an increase in the
+			 * total number of rows across all partitions, which is clearly
+			 * wrong.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by the
+			 * EvalPlanQual machinery, but for an UPDATE that we've translated
+			 * into a DELETE from this partition and an INSERT into some other
+			 * partition, that's not available, because CTID chains can't span
+			 * relation boundaries.  We mimic the semantics to a limited extent
+			 * by skipping the INSERT if the DELETE fails to find a tuple. This
+			 * ensures that two concurrent attempts to UPDATE the same tuple at
+			 * the same time can't turn one tuple into two, and that an UPDATE
+			 * of a just-deleted tuple can't resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * Updates set the transition capture map only when a new subplan
+			 * is chosen.  But for inserts, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INSERT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(tupconv_map,
+											  tuple,
+											  proute->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate, true);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1660,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1500,62 +1682,149 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		int			numResultRelInfos;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+		ExecSetupChildParentMap(mtstate,
+								(mtstate->mt_partition_tuple_routing != NULL));
+
+		/*
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
+		 */
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		numResultRelInfos = (proute != NULL ?
-							 proute->num_partitions :
-							 mtstate->mt_nplans);
+/*
+ * Initialize the child-to-root tuple conversion map array.
+ *
+ * This map array is required for two purposes:
+ * 1. For update-tuple-routing. We need to convert the tuple from the subplan
+ *    result rel to the root partitioned table descriptor.
+ * 2. For capturing transition tuples when the target table is a partitioned
+ *    table. For updates, we need to convert the tuple from the subplan result
+ *    rel to the target table descriptor, and for inserts, we need to convert
+ *    the inserted tuple from the leaf partition to the target table
+ *    descriptor.
+ *
+ * The caller can request either a per-subplan map or per-leaf-partition map.
+ */
+static void
+ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf)
+{
+	ResultRelInfo *rootRelInfo = getASTriggerResultRelInfo(mtstate);
+	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+	TupleDesc	outdesc;
+	int			numResultRelInfos;
+	int			i;
 
+	/* First check if there is already one */
+	if (mtstate->mt_childparent_tupconv_maps)
+	{
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * If per-leaf map is required and the map is already created, that map
+		 * has to be per-leaf. If that map is per-subplan, we won't be able to
+		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
+		 * will be able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().  So if the callers might need to access
+		 * the map both leaf-partition-wise and subplan-wise, they should make
+		 * sure that the first time this function is called, it should be
+		 * called with perleaf=true so that the map created is per-leaf, not
+		 * per-subplan.
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+		return;
+	}
 
-		/* Choose the right set of partitions */
-		if (proute != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = proute->partitions;
+	/* If perleaf is true, partition tuple routing info has to be present */
+	Assert(!perleaf || proute != NULL);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	numResultRelInfos = (perleaf ? proute->num_partitions :
+								   mtstate->mt_nplans);
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the
+	 * one used in the tuplestore.  The map pointers may be NULL when no
+	 * conversion is necessary, which is hopefully a common case for
+	 * partitions.
+	 */
+	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+		palloc(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	/* Choose the right set of partitions */
+	if (perleaf)
+	{
+		/*
+		 * For tuple routing among partitions, we need TupleDescs based on the
+		 * partition routing table.
+		 */
+		ResultRelInfo **resultRelInfos;
+
+		resultRelInfos = proute->partitions;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
 		}
 
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * Save the info that the tuple conversion map is per-leaf, not
+		 * per-subplan
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		mtstate->mt_is_tupconv_perpart = true;
+	}
+	else
+	{
+		/* Otherwise we need the ResultRelInfo for each subplan. */
+		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+
+		for (i = 0; i < numResultRelInfos; ++i)
+		{
+			mtstate->mt_childparent_tupconv_maps[i] =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+									   outdesc,
+									   gettext_noop("could not convert row type"));
+		}
+	}
+
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static inline TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
+
+	/*
+	 * If the tuple conversion map array is per-partition, we need to first get
+	 * the index into the partition array.
+	 */
+	if (mtstate->mt_is_tupconv_perpart)
+	{
+		int leaf_index;
+		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+		Assert(proute && proute->subplan_partition_offsets != NULL);
+		leaf_index = proute->subplan_partition_offsets[whichplan];
+
+		Assert(leaf_index >= 0 && leaf_index < proute->num_partitions);
+		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_childparent_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1931,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2054,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1831,9 +2099,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partColsUpdated;
 	PartitionTupleRouting *proute = NULL;
 	int			num_partitions = 0;
 
@@ -1908,6 +2179,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1945,15 +2226,32 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		proute = mtstate->mt_partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(mtstate,
 										   rel, node->nominalRelation,
 										   estate);
 		num_partitions = proute->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1964,6 +2262,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMap(mtstate, false);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1993,26 +2302,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2021,17 +2333,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2048,7 +2369,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2084,22 +2405,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79..747e545 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partColsUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2263,6 +2264,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 30ccc9c..99b554a 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(part_cols_updated);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df1..b35bce3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partColsUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2105,6 +2106,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partColsUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2527,6 +2529,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866..22d8b9d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partColsUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c5304b7..fd1a583 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,7 +1364,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1403,7 +1403,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283..86e7e74 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -279,6 +279,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partColsUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2373,6 +2374,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partColsUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6442,6 +6444,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partColsUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6468,6 +6471,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partColsUpdated = partColsUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dad..5387043 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6155,17 +6159,24 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets
+ *		*part_cols_updated to true if any of the root rte's updated
+ *		columns is used in the partition key either of the relation whose RTI
+ *		is specified or of any child relation.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *part_cols_updated)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (part_cols_updated)
+		*part_cols_updated = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6173,6 +6184,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (part_cols_updated)
+				*part_cols_updated = pc->part_cols_updated;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 7ef391f..e6b1534 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *part_cols_updated);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1461,16 +1462,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		part_cols_updated = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &part_cols_updated);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1487,6 +1491,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->part_cols_updated = part_cols_updated;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1563,7 +1568,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *part_cols_updated)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1578,6 +1584,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*part_cols_updated)
+		*part_cols_updated =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1617,7 +1634,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   part_cols_updated);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 48b4db7..96ab100 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3274,6 +3274,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partColsUpdated' is true if any partitioning columns are being updated,
+ *		either from the target relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3287,6 +3289,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3354,6 +3357,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partColsUpdated = partColsUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index b5df357..0afa41e 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -67,6 +67,9 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * subplan_partition_offsets	int array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -80,7 +83,9 @@ typedef struct PartitionTupleRouting
 	ResultRelInfo **partitions;
 	int			num_partitions;
 	TupleConversionMap **partition_tupconv_maps;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
@@ -90,6 +95,10 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot);
 extern void ExecCleanupTupleRouting(PartitionTupleRouting *proute);
 
 #endif							/* EXECPARTITION_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4bb5cb1..8b5391d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -991,8 +991,9 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_childparent_tupconv_maps;
+	/* Per plan/partition map for tuple conversion from child to root */
+	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5..baf3c07 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partColsUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8..6bf68f3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1674,6 +1674,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partColsUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2124,6 +2125,8 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		part_cols_updated;	/* is the partition key of any of
+									 * the partitioned tables updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 725694f..ef7173f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -242,6 +242,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 997b91f..29173d3 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *part_cols_updated);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..ee7a75a 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,462 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+CREATE TABLE range_parted (
 	a text,
-	b int
-) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
-create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
+) PARTITION BY RANGE (a, b);
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
 -- ok
-update range_parted set b = b + 1 where b = 10;
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- ok, row movement , check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+DROP VIEW upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+                 (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+    return null;
+  end;
+$$;
+CREATE TRIGGER trans_updatetrig
+  AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+  AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+  AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, usually the
+-- conversion is not needed for inserted rows, because the original tuple is
+-- already compatible with the desired transition tuple format. But conversion
+-- happens when there is a BR trigger because the trigger can change the
+-- inserted row. So we require to install BR triggers on those child partitions
+-- where the rows are moved as part of update-row-movement.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+   NEW.b = NEW.b + 1;
+   return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,110,1,), (b,15,98,2,), (b,17,106,16,), (b,19,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,146,1,), (b,16,147,2,), (b,17,155,16,), (b,19,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+   NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
+-- ok
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+  AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+  AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+  AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+  AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+  AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+  AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+  AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+  AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+  AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+  AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+  AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+  AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,21 +661,192 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
-create table list_parted (
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+CREATE TABLE list_parted (
 	a text,
 	b int
-) partition by list (a);
-create table list_part1  partition of list_parted for values in ('a', 'b');
-create table list_default partition of list_parted default;
-insert into list_part1 values ('a', 1);
-insert into list_default values ('d', 10);
+) PARTITION BY list (a);
+CREATE TABLE list_part1  PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
 -- fail
-update list_default set a = 'a' where a = 'd';
+UPDATE list_default set a = 'a' WHERE a = 'd';
 ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
-update list_default set a = 'x' where a = 'd';
+UPDATE list_default set a = 'x' WHERE a = 'd';
+DROP TABLE list_parted;
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Tests for BR UPDATE triggers changing the partition key.
+-----------
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b  = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+   return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+   for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+NOTICE:  Trigger: Got OLD row (2,70,1), but returning NULL
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+DROP FUNCTION func_parted_mod_b();
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+DROP TABLE non_parted;
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,14 +868,11 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok, row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
-drop table range_parted;
-drop table list_parted;
 drop table hash_parted;
 drop operator class custom_opclass using hash;
 drop function dummy_hashint4(a int4, seed int8);
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..f316446 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,330 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+
+CREATE TABLE range_parted (
 	a text,
-	b int
-) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
-create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
+) PARTITION BY RANGE (a, b);
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+
+:show_data;
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
+-- ok
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+-- ok, row movement , check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+
+:show_data;
+
+-- cleanup
+DROP VIEW upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+                 (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+    return null;
+  end;
+$$;
+
+CREATE TRIGGER trans_updatetrig
+  AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+  AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+  AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, usually the
+-- conversion is not needed for inserted rows, because the original tuple is
+-- already compatible with the desired transition tuple format. But conversion
+-- happens when there is a BR trigger because the trigger can change the
+-- inserted row. So we require to install BR triggers on those child partitions
+-- where the rows are moved as part of update-row-movement.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+   NEW.b = NEW.b + 1;
+   return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+   NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
 -- ok
-update range_parted set b = b + 1 where b = 10;
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+  AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+  AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+  AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+  AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+  AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+  AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+  AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+  AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+  AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+  AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+  AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+  AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+:show_data;
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
+
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,19 +439,121 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
-create table list_parted (
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+
+CREATE TABLE list_parted (
 	a text,
 	b int
-) partition by list (a);
-create table list_part1  partition of list_parted for values in ('a', 'b');
-create table list_default partition of list_parted default;
-insert into list_part1 values ('a', 1);
-insert into list_default values ('d', 10);
+) PARTITION BY list (a);
+CREATE TABLE list_part1  PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
 
 -- fail
-update list_default set a = 'a' where a = 'd';
+UPDATE list_default set a = 'a' WHERE a = 'd';
 -- ok
-update list_default set a = 'x' where a = 'd';
+UPDATE list_default set a = 'x' WHERE a = 'd';
+
+DROP TABLE list_parted;
+
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+
+
+-----------
+-- Tests for BR UPDATE triggers changing the partition key.
+-----------
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b  = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+   return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+   for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+DROP FUNCTION func_parted_mod_b();
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+DROP TABLE non_parted;
+
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
 
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
@@ -169,13 +576,12 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok, row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 
 -- cleanup
-drop table range_parted;
-drop table list_parted;
 drop table hash_parted;
 drop operator class custom_opclass using hash;
 drop function dummy_hashint4(a int4, seed int8);

#241

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#239)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 14 January 2018 at 17:27, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 13 January 2018 at 02:56, Robert Haas <robertmhaas@gmail.com> wrote:

I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
there are no partitions, but not use it when partitions are present.
What do you think about that?

Even where partitions are present, in the usual case where there are no transition tables we won't require per-leaf map at all [1]. So I think we should keep mt_per_sub_plan_maps only for the case where per-leaf map is not allocated. And we will not allocate mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words, exactly one of the two maps will be allocated.

This is turning out to be close to what's already there in the last patch versions: use a single map array, and an offsets array. The difference is : in the patch I am using the *same* variable for the two maps. Where as, now we are talking about two different array variables for maps, but only allocating one of them.

Are you ok with this ? I think the thing you were against was to have a common *variable* for two purposes. But above, I am saying we have two variables but assign a map array to only *one* of them and leave the other unused.

---------

Regarding the on-demand map allocation ....
Where mt_per_sub_plan_maps is allocated, we won't have the on-demand allocation: all the maps will be allocated initially. The reason is becaues the map_is_required array is only per-leaf. Or else, again, we need to keep another map_is_required array for per-subplan. May be we can support the on-demand stuff for subplan maps also, but only as a separate change after we are done with update-partition-key.

---------

Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a bool, and name it : mt_per_leaf_map_not_required. When it is true for a given index, it means, we have already called convert_tuples_by_name() and it returned NULL; i.e. it means we are sure that map is not required. A false value means we need to call convert_tuples_by_name() if it is NULL, and then set mt_per_leaf_map_not_required to (map == NULL).

Instead of a bool array, , we can instead make it a Bitmapset. But I think access would become slower as compared to array, particularly because it is going to be a heavily used function.

I went ahead and did the above changes. I haven't yet merged these
changes in the main patch. Instead, I have attached it as an
incremental patch to be applied on the main v36 patch. The incremental
patch is not yet quite polished, and quite a bit of cosmetic changes
might be required, plus testing. But am posting it in case I have some
early feedback. Details :

The per-subplan map array variable is kept in ModifyTableState :
-       TupleConversionMap **mt_childparent_tupconv_maps;
-       /* Per plan/partition map for tuple conversion from child to root */
-       bool            mt_is_tupconv_perpart;  /* Is the above map
per-partition ? */
+       TupleConversionMap **mt_per_subplan_tupconv_maps;
+       /* Per plan map for tuple conversion from child to root */
 } ModifyTableState;

The per-leaf array variable and the not_required array is kept in
PartitionTupleRouting :
-       TupleConversionMap **partition_tupconv_maps;
+       TupleConversionMap **parent_child_tupconv_maps;
+       TupleConversionMap **child_parent_tupconv_maps;
+       bool       *child_parent_tupconv_map_not_reqd;
As you can see above, all the arrays are per-partition. So removed the
per-leaf tag in these arrays. Instead, renamed the existing
partition_tupconv_maps to parent_child_tupconv_maps, and the new
per-leaf array to child_parent_tupconv_maps

Have two separate functions ExecSetupChildParentMapForLeaf() and
ExecSetupChildParentMapForSubplan() since most of their code is
different. And now because of this, we can re-use
ExecSetupChildParentMapForLeaf() in both copy.c and nodeModifyTable.c.

Even inserts/copy will benefit from the on-demand map allocation. This
is because now there is a function TupConvMapForLeaf() that is called
in both copy.c and ExecInsert(). This is the function that does
on-demand allocation.

Attached the incremental patch conversion_map_changes.patch that has
the above changes. It is to be applied over the latest main patch
(update-partition-key_v36.patch).

Attachments:

conversion_map_changes.patchapplication/octet-stream; name=conversion_map_changes.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index d869ac5..04a24c6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -170,7 +170,6 @@ typedef struct CopyStateData
 	PartitionTupleRouting *partition_tuple_routing;
 
 	TransitionCaptureState *transition_capture;
-	TupleConversionMap **transition_tupconv_maps;
 
 	/*
 	 * These variables are used to reduce overhead in textual COPY FROM.
@@ -2481,19 +2480,7 @@ CopyFrom(CopyState cstate)
 		 * tuple).
 		 */
 		if (cstate->transition_capture != NULL)
-		{
-			int			i;
-
-			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * proute->num_partitions);
-			for (i = 0; i < proute->num_partitions; ++i)
-			{
-				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(proute->partitions[i]->ri_RelationDesc),
-										   RelationGetDescr(cstate->rel),
-										   gettext_noop("could not convert row type"));
-			}
-		}
+			ExecSetupChildParentMapForLeaf(proute);
 	}
 
 	/*
@@ -2650,7 +2637,8 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						cstate->transition_tupconv_maps[leaf_part_index];
+						TupConvMapForLeaf(proute, saved_resultRelInfo,
+										  leaf_part_index);
 				}
 				else
 				{
@@ -2667,7 +2655,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->partition_tupconv_maps[leaf_part_index],
+			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
 											  tuple,
 											  proute->partition_tuple_slot,
 											  &slot);
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 5100d82..9d3677f 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -73,7 +73,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	proute->num_partitions = list_length(leaf_parts);
 	proute->partitions = (ResultRelInfo **) palloc(proute->num_partitions *
 												   sizeof(ResultRelInfo *));
-	proute->partition_tupconv_maps =
+	proute->parent_child_tupconv_maps =
 		(TupleConversionMap **) palloc0(proute->num_partitions *
 										sizeof(TupleConversionMap *));
 
@@ -198,7 +198,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		proute->partition_tupconv_maps[i] =
+		proute->parent_child_tupconv_maps[i] =
 			convert_tuples_by_name(tupDesc, part_tupdesc,
 								   gettext_noop("could not convert row type"));
 
@@ -358,6 +358,69 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 }
 
 /*
+ * Initialize the per-leaf-partition child-to-root tuple conversion map array.
+ *
+ * This map is required for capturing transition tuples when the target table
+ * is a partitioned table. For tuple that is routed by INSERT or UPDATE, we
+ * need to convert from the leaf partition to the target table descriptor.
+ *
+ */
+void
+ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+{
+	Assert(proute != NULL);
+
+	/*
+	 * These array elements gets filled up with maps on an on-demand basis.
+	 * Initially just set all of them to NULL.
+	 */
+	proute->child_parent_tupconv_maps =
+		palloc0(sizeof(TupleConversionMap *) * proute->num_partitions);
+
+	/* Same is the case for this array. */
+	proute->child_parent_tupconv_map_not_reqd =
+		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+}
+
+/*
+ * TupConvMapForLeaf -- For a given leaf partitions index, get the tuple
+ * conversion map.
+ */
+TupleConversionMap *
+TupConvMapForLeaf(PartitionTupleRouting *proute,
+					 ResultRelInfo *rootRelInfo, int leaf_index)
+{
+	Assert(leaf_index >= 0 && leaf_index < proute->num_partitions);
+	Assert(proute->child_parent_tupconv_maps != NULL);
+
+	/* If it is already determined that the map is not required, return NULL. */
+	if (proute->child_parent_tupconv_map_not_reqd[leaf_index])
+		return NULL;
+	else
+	{
+		ResultRelInfo **resultRelInfos = proute->partitions;
+		TupleConversionMap **map = proute->child_parent_tupconv_maps + leaf_index;
+
+		/*
+		 * Either the map is already allocated, or it is yet to be determined
+		 * if the map is required.
+		 */
+		if (!*map)
+		{
+			*map =
+				convert_tuples_by_name(RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc),
+									   RelationGetDescr(rootRelInfo->ri_RelationDesc),
+									   gettext_noop("could not convert row type"));
+
+			/* Update the array element with the new info */
+			proute->child_parent_tupconv_map_not_reqd[leaf_index] =
+				(*map == NULL);
+		}
+		return *map;
+	}
+}
+
+/*
  * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
  * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
  * updated with the 'new_slot'. 'new_slot' typically should be one of the
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 5f1c51f..0335339 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -64,7 +64,9 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-static void ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf);
+static ResultRelInfo *getASTriggerResultRelInfo(ModifyTableState *node);
+static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
+static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static inline TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 													int whichplan);
 /*
@@ -336,9 +338,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 
-				Assert(mtstate->mt_is_tupconv_perpart);
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+					TupConvMapForLeaf(proute,
+									  getASTriggerResultRelInfo(mtstate),
+									  leaf_part_index);
 			}
 			else
 			{
@@ -352,16 +355,16 @@ ExecInsert(ModifyTableState *mtstate,
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
 		{
-			Assert(mtstate->mt_is_tupconv_perpart);
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_childparent_tupconv_maps[leaf_part_index];
+				TupConvMapForLeaf(proute, getASTriggerResultRelInfo(mtstate),
+								  leaf_part_index);
 		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		tuple = ConvertPartitionTupleSlot(proute->partition_tupconv_maps[leaf_part_index],
+		tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
 										  tuple,
 										  proute->partition_tuple_slot,
 										  &slot);
@@ -1682,8 +1685,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMap(mtstate,
-								(mtstate->mt_partition_tuple_routing != NULL));
+		ExecSetupChildParentMapForTcs(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1698,55 +1700,32 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 }
 
 /*
- * Initialize the child-to-root tuple conversion map array.
+ * Initialize the child-to-root tuple conversion map array for UPDATE subplans.
  *
  * This map array is required for two purposes:
  * 1. For update-tuple-routing. We need to convert the tuple from the subplan
  *    result rel to the root partitioned table descriptor.
- * 2. For capturing transition tuples when the target table is a partitioned
- *    table. For updates, we need to convert the tuple from the subplan result
- *    rel to the target table descriptor, and for inserts, we need to convert
- *    the inserted tuple from the leaf partition to the target table
- *    descriptor.
- *
- * The caller can request either a per-subplan map or per-leaf-partition map.
+ * 2. For capturing transition tuples For updates, we need to convert the tuple
+ *    from the subplan result rel to the target table descriptor.
  */
-static void
-ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf)
+void
+ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 {
 	ResultRelInfo *rootRelInfo = getASTriggerResultRelInfo(mtstate);
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+	ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
 	TupleDesc	outdesc;
-	int			numResultRelInfos;
+	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/* First check if there is already one */
-	if (mtstate->mt_childparent_tupconv_maps)
-	{
-		/*
-		 * If per-leaf map is required and the map is already created, that map
-		 * has to be per-leaf. If that map is per-subplan, we won't be able to
-		 * access the maps leaf-partition-wise. But if the map is per-leaf, we
-		 * will be able to access the maps subplan-wise using the
-		 * subplan_partition_offsets map using function
-		 * tupconv_map_for_subplan().  So if the callers might need to access
-		 * the map both leaf-partition-wise and subplan-wise, they should make
-		 * sure that the first time this function is called, it should be
-		 * called with perleaf=true so that the map created is per-leaf, not
-		 * per-subplan.
-		 */
-		Assert(!(perleaf && !mtstate->mt_is_tupconv_perpart));
+	/*
+	 * First check if there is already one. Even if there is already a per-leaf
+	 * map, we won't require a per-subplan one, since we will use the subplan
+	 * offset array to convert the subplan index to per-leaf index.
+	 */
+	if (mtstate->mt_per_subplan_tupconv_maps ||
+		(mtstate->mt_partition_tuple_routing &&
+		mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
 		return;
-	}
-
-	/* If perleaf is true, partition tuple routing info has to be present */
-	Assert(!perleaf || proute != NULL);
-
-	numResultRelInfos = (perleaf ? proute->num_partitions :
-								   mtstate->mt_nplans);
-
-	/* Get tuple descriptor of the root partitioned table. */
-	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
 
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the
@@ -1754,48 +1733,55 @@ ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf)
 	 * conversion is necessary, which is hopefully a common case for
 	 * partitions.
 	 */
-	mtstate->mt_childparent_tupconv_maps = (TupleConversionMap **)
+
+	/* Get tuple descriptor of the root partitioned table. */
+	outdesc = RelationGetDescr(rootRelInfo->ri_RelationDesc);
+
+	mtstate->mt_per_subplan_tupconv_maps =
 		palloc(sizeof(TupleConversionMap *) * numResultRelInfos);
 
-	/* Choose the right set of partitions */
-	if (perleaf)
+	for (i = 0; i < numResultRelInfos; ++i)
 	{
-		/*
-		 * For tuple routing among partitions, we need TupleDescs based on the
-		 * partition routing table.
-		 */
-		ResultRelInfo **resultRelInfos;
-
-		resultRelInfos = proute->partitions;
+		mtstate->mt_per_subplan_tupconv_maps[i] =
+			convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+								   outdesc,
+								   gettext_noop("could not convert row type"));
+	}
+}
 
-		for (i = 0; i < numResultRelInfos; ++i)
-		{
-			mtstate->mt_childparent_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-									   outdesc,
-									   gettext_noop("could not convert row type"));
-		}
+/*
+ * Initialize the child-to-root tuple conversion map array required for
+ * capturing transition tuples.
+ *
+ * For updates, a per-subplan map is required, and additionally a
+ * per-leaf-partition map is required when tuples are routed in case of updates
+ * or inserts.
+ */
+static void
+ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
+{
+	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 
+	/*
+	 * For transition tables, we need a subplan-indexed access, and where
+	 * tuple-routing is present, we also require a leaf-indexed access.
+	 */
+	if (proute)
+	{
 		/*
-		 * Save the info that the tuple conversion map is per-leaf, not
-		 * per-subplan
+		 * If per-leaf map is to be created, the subplan map has to be NULL.
+		 * If the subplan map is already created, we won't be able to access
+		 * the map leaf-partition-wise.  But if the map is per-leaf, we will be
+		 * able to access the maps subplan-wise using the
+		 * subplan_partition_offsets map using function
+		 * tupconv_map_for_subplan().
 		 */
-		mtstate->mt_is_tupconv_perpart = true;
-	}
-	else
-	{
-		/* Otherwise we need the ResultRelInfo for each subplan. */
-		ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
 
-		for (i = 0; i < numResultRelInfos; ++i)
-		{
-			mtstate->mt_childparent_tupconv_maps[i] =
-				convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-									   outdesc,
-									   gettext_noop("could not convert row type"));
-		}
+		ExecSetupChildParentMapForLeaf(proute);
 	}
-
+	else
+		ExecSetupChildParentMapForSubplan(mtstate);
 }
 
 /*
@@ -1804,13 +1790,11 @@ ExecSetupChildParentMap(ModifyTableState *mtstate, bool perleaf)
 static inline TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	Assert(mtstate->mt_childparent_tupconv_maps != NULL);
-
 	/*
 	 * If the tuple conversion map array is per-partition, we need to first get
 	 * the index into the partition array.
 	 */
-	if (mtstate->mt_is_tupconv_perpart)
+	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
 	{
 		int leaf_index;
 		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
@@ -1818,13 +1802,13 @@ tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 		Assert(proute && proute->subplan_partition_offsets != NULL);
 		leaf_index = proute->subplan_partition_offsets[whichplan];
 
-		Assert(leaf_index >= 0 && leaf_index < proute->num_partitions);
-		return mtstate->mt_childparent_tupconv_maps[leaf_index];
+		return TupConvMapForLeaf(proute, getASTriggerResultRelInfo(mtstate),
+								 leaf_index);
 	}
 	else
 	{
 		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_childparent_tupconv_maps[whichplan];
+		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 	}
 }
 
@@ -2270,7 +2254,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * from the root.  Skip this setup if it's not a partition key update.
 	 */
 	if (update_tuple_routing_needed)
-		ExecSetupChildParentMap(mtstate, false);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 0afa41e..06e6edd 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -62,11 +62,22 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								for every leaf partition in the partition tree.
  * num_partitions				Number of leaf partitions in the partition tree
  *								(= 'partitions' array length)
- * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
  *								entry for every leaf partition (required to
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the leaf
+ *								partition's rowtype to the root table's rowtype
+ *								after tuple routing is done)
+ * child_parent_tupconv_map_not_reqd
+ *								Array of bool. True value means that a map is
+ *								determined to be not required for the given
+ *								partition. False means either we haven't yet
+ *								checked if a map is required, or it was
+ *								determined to be required.
  * subplan_partition_offsets	int array ordered by UPDATE subplans. Each
  *								element of this array has the index into the
  *								corresponding partition in 'partitions' array.
@@ -82,7 +93,9 @@ typedef struct PartitionTupleRouting
 	int			num_dispatch;
 	ResultRelInfo **partitions;
 	int			num_partitions;
-	TupleConversionMap **partition_tupconv_maps;
+	TupleConversionMap **parent_child_tupconv_maps;
+	TupleConversionMap **child_parent_tupconv_maps;
+	bool	   *child_parent_tupconv_map_not_reqd;
 	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
 	TupleTableSlot *root_tuple_slot;
@@ -95,6 +108,9 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
+extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
+					 ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
 						  TupleTableSlot *new_slot,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8b5391d..defd5cd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -991,9 +991,8 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_childparent_tupconv_maps;
-	/* Per plan/partition map for tuple conversion from child to root */
-	bool		mt_is_tupconv_perpart;	/* Is the above map per-partition ? */
+	TupleConversionMap **mt_per_subplan_tupconv_maps;
+	/* Per plan map for tuple conversion from child to root */
 } ModifyTableState;
 
 /* ----------------

#242

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#239)

Re: [HACKERS] UPDATE of partition key

On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Even where partitions are present, in the usual case where there are
no transition tables we won't require per-leaf map at all [1]. So I
think we should keep mt_per_sub_plan_maps only for the case where
per-leaf map is not allocated. And we will not allocate
mt_per_sub_plan_maps when mt_per_leaf_maps is needed. In other words,
exactly one of the two maps will be allocated.

This is turning out to be close to what's already there in the last
patch versions: use a single map array, and an offsets array. The
difference is : in the patch I am using the *same* variable for the
two maps. Where as, now we are talking about two different array
variables for maps, but only allocating one of them.

Are you ok with this ? I think the thing you were against was to have
a common *variable* for two purposes. But above, I am saying we have
two variables but assign a map array to only *one* of them and leave
the other unused.

Yes, I'm OK with that.

Regarding the on-demand map allocation ....
Where mt_per_sub_plan_maps is allocated, we won't have the on-demand
allocation: all the maps will be allocated initially. The reason is
becaues the map_is_required array is only per-leaf. Or else, again, we
need to keep another map_is_required array for per-subplan. May be we
can support the on-demand stuff for subplan maps also, but only as a
separate change after we are done with update-partition-key.

Sure.

Regarding mt_per_leaf_tupconv_required, I am thinking we can make it a
bool array, and name it : mt_per_leaf_map_not_required. When it is
true for a given index, it means, we have already called
convert_tuples_by_name() and it returned NULL; i.e. it means we are
sure that map is not required. A false value means we need to call
convert_tuples_by_name() if it is NULL, and then set
mt_per_leaf_map_not_required to (map == NULL).

OK.

Instead of a bool array, we can even make it a Bitmapset. But I think
access would become slower as compared to array, particularly because
it is going to be a heavily used function.

It probably makes little difference -- the Bitmapset will be more
compact (which saves time) but involve function calls (which cost
time).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#243

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Robert Haas (#242)

Re: [HACKERS] UPDATE of partition key

On 16 January 2018 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Even where partitions are present, in the usual case where there are
Instead of a bool array, we can even make it a Bitmapset. But I think
access would become slower as compared to array, particularly because
it is going to be a heavily used function.

It probably makes little difference -- the Bitmapset will be more
compact (which saves time) but involve function calls (which cost
time).

I'm not arguing in either direction, but you'd also want to factor in
how Bitmapsets only allocate words for the maximum stored member,
which might mean multiple realloc() calls resulting in palloc/memcpy
calls. The array would just be allocated in a single chunk, although
it would be more memory and would require a memset too, however,
that's likely much cheaper than the palloc() anyway.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#244

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#241)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 15 January 2018 at 16:11, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I went ahead and did the above changes. I haven't yet merged these
changes in the main patch. Instead, I have attached it as an
incremental patch to be applied on the main v36 patch. The incremental
patch is not yet quite polished, and quite a bit of cosmetic changes
might be required, plus testing. But am posting it in case I have some
early feedback.

I have now embedded the above incremental patch changes into the main
patch (v37) , which is attached.

Because it is used heavily in case of transition tables with
partitions, I have made TupConvMapForLeaf() a macro. And the actual
creation of the map is in separate function CreateTupConvMapForLeaf(),
so as to reduce the macro size.

Retained child_parent_map_not_required as a bool array, as against a bitmap.

To include one scenario related to on-demand map allocation that was
not getting covered with the update.sql test, I added one more
scenario in that file :
+-- Case where per-partition tuple conversion map array is allocated, but the
+-- map is not required for the particular tuple that is routed, thanks to
+-- matching table attributes of the partition and the target table.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v37.patchapplication/octet-stream; name=update-partition-key_v37.patchDownload

diff --git a/contrib/file_fdw/input/file_fdw.source b/contrib/file_fdw/input/file_fdw.source
index e6821d6..88cb5f2 100644
--- a/contrib/file_fdw/input/file_fdw.source
+++ b/contrib/file_fdw/input/file_fdw.source
@@ -178,6 +178,7 @@ SELECT tableoid::regclass, * FROM p1;
 SELECT tableoid::regclass, * FROM p2;
 INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
 INSERT INTO pt VALUES (2, 'xyzzy');
+UPDATE pt set a = 1 where a = 2; -- ERROR
 SELECT tableoid::regclass, * FROM pt;
 SELECT tableoid::regclass, * FROM p1;
 SELECT tableoid::regclass, * FROM p2;
diff --git a/contrib/file_fdw/output/file_fdw.source b/contrib/file_fdw/output/file_fdw.source
index 709c43e..e07bb24 100644
--- a/contrib/file_fdw/output/file_fdw.source
+++ b/contrib/file_fdw/output/file_fdw.source
@@ -344,6 +344,8 @@ SELECT tableoid::regclass, * FROM p2;
 INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
 ERROR:  cannot route inserted tuples to a foreign table
 INSERT INTO pt VALUES (2, 'xyzzy');
+UPDATE pt set a = 1 where a = 2; -- ERROR
+ERROR:  cannot route inserted tuples to a foreign table
 SELECT tableoid::regclass, * FROM pt;
  tableoid | a |   b   
 ----------+---+-------
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a4..6d97f26 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried out the
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..296e301 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,16 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations"/>.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..8f83e6a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by an <command>INSERT</command> into
+    the new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and an <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6bfca2a..04a24c6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -170,7 +170,6 @@ typedef struct CopyStateData
 	PartitionTupleRouting *partition_tuple_routing;
 
 	TransitionCaptureState *transition_capture;
-	TupleConversionMap **transition_tupconv_maps;
 
 	/*
 	 * These variables are used to reduce overhead in textual COPY FROM.
@@ -2481,19 +2480,7 @@ CopyFrom(CopyState cstate)
 		 * tuple).
 		 */
 		if (cstate->transition_capture != NULL)
-		{
-			int			i;
-
-			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * proute->num_partitions);
-			for (i = 0; i < proute->num_partitions; ++i)
-			{
-				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(proute->partitions[i]->ri_RelationDesc),
-										   RelationGetDescr(cstate->rel),
-										   gettext_noop("could not convert row type"));
-			}
-		}
+			ExecSetupChildParentMapForLeaf(proute);
 	}
 
 	/*
@@ -2587,7 +2574,6 @@ CopyFrom(CopyState cstate)
 		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
-			TupleConversionMap *map;
 			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
 
 			/*
@@ -2651,7 +2637,8 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						cstate->transition_tupconv_maps[leaf_part_index];
+						TupConvMapForLeaf(proute, saved_resultRelInfo,
+										  leaf_part_index);
 				}
 				else
 				{
@@ -2668,23 +2655,10 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = proute->partition_tupconv_maps[leaf_part_index];
-			if (map)
-			{
-				Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-				tuple = do_convert_tuple(tuple, map);
-
-				/*
-				 * We must use the partition's tuple descriptor from this
-				 * point on.  Use a dedicated slot from this point on until
-				 * we're finished dealing with the partition.
-				 */
-				slot = proute->partition_tuple_slot;
-				Assert(slot != NULL);
-				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-			}
+			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+											  tuple,
+											  proute->partition_tuple_slot,
+											  &slot);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 1c488c3..e8af18e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	to the partition-key being changed, then this function is called once when
+ *	the row is deleted (to capture OLD row), and once when the row is inserted
+ *	into another partition (to capture NEW row).  This is done separately because
+ *	DELETE and INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE events fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for a row being inserted,
+		 * whereas newtup is NULL when the event is for a row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,18 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so we expect exactly one of them
+		 * to be NULL.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 8c0d2df..a0a611c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -54,7 +54,11 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL,
+				  *update_rri = NULL;
+	int			num_update_rri = 0,
+				update_rri_index = 0;
+	bool		is_update = false;
 	PartitionTupleRouting *proute;
 
 	/*
@@ -69,10 +73,56 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	proute->num_partitions = list_length(leaf_parts);
 	proute->partitions = (ResultRelInfo **) palloc(proute->num_partitions *
 												   sizeof(ResultRelInfo *));
-	proute->partition_tupconv_maps =
+	proute->parent_child_tupconv_maps =
 		(TupleConversionMap **) palloc0(proute->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	/* Initialization specific to update */
+	if (mtstate && mtstate->operation == CMD_UPDATE)
+	{
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+		is_update = true;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, update_rri_index should be set to the first
+		 * per-subplan result rel (i.e. 0), and then should be shifted as we
+		 * find them one by one while scanning the leaf partition oids. (It is
+		 * already set to 0 during initialization, above).
+		 */
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		proute->subplan_partition_offsets =
+			palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		proute->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(proute->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -81,38 +131,81 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 */
 	proute->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(proute->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				proute->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = &leaf_part_arr[i];
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in proute->partitions are
-		 * eventually closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * proute->partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		proute->partition_tupconv_maps[i] =
+		proute->parent_child_tupconv_maps[i] =
 			convert_tuples_by_name(tupDesc, part_tupdesc,
 								   gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for an INSERT.  An UPDATE
+		 * of a partition-key becomes a DELETE+INSERT operation, so this check
+		 * is still required when the operation is CMD_UPDATE.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -132,10 +225,16 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		proute->partitions[i] = leaf_part_rri++;
+		proute->partitions[i] = leaf_part_rri;
 		i++;
 	}
 
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
+
 	return proute;
 }
 
@@ -259,6 +358,98 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 }
 
 /*
+ * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
+ * child-to-root tuple conversion map array.
+ *
+ * This map is required for capturing transition tuples when the target table
+ * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
+ * we need to convert it from the leaf partition to the target table
+ * descriptor.
+ */
+void
+ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+{
+	Assert(proute != NULL);
+
+	/*
+	 * These array elements gets filled up with maps on an on-demand basis.
+	 * Initially just set all of them to NULL.
+	 */
+	proute->child_parent_tupconv_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
+										proute->num_partitions);
+
+	/* Same is the case for this array. All the values are set to false */
+	proute->child_parent_map_not_required =
+		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+}
+
+/*
+ * CreateTupConvMapForLeaf -- For a given leaf partition index, create a tuple
+ * conversion map, if not already allocated.
+ *
+ * This function should be called only after it is found that
+ * child_parent_map_not_required is false for the given partition.
+ */
+TupleConversionMap *
+CreateTupConvMapForLeaf(PartitionTupleRouting *proute,
+						ResultRelInfo *rootRelInfo, int leaf_index)
+{
+	ResultRelInfo **resultRelInfos = proute->partitions;
+	TupleConversionMap **map;
+
+	Assert(proute->child_parent_tupconv_maps != NULL);
+	map = proute->child_parent_tupconv_maps + leaf_index;
+
+	/*
+	 * Either the map is already allocated, or it is yet to be determined if it
+	 * is required.
+	 */
+	if (!*map)
+	{
+		*map =
+			convert_tuples_by_name(RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc),
+								   RelationGetDescr(rootRelInfo->ri_RelationDesc),
+								   gettext_noop("could not convert row type"));
+
+		/* Update the array element with the new info */
+		proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	}
+	return *map;
+}
+
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+HeapTuple
+ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
+/*
  * ExecCleanupTupleRouting -- Clean up objects allocated for partition tuple
  * routing.
  *
@@ -268,6 +459,7 @@ void
 ExecCleanupTupleRouting(PartitionTupleRouting * proute)
 {
 	int			i;
+	int			subplan_index;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -284,15 +476,34 @@ ExecCleanupTupleRouting(PartitionTupleRouting * proute)
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
-	for (i = 0; i < proute->num_partitions; i++)
+	for (subplan_index = i = 0; i < proute->num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
+		/*
+		 * If this result rel is one of the UPDATE subplan result rels, let
+		 * ExecEndPlan() close it. For INSERT or COPY,
+		 * proute->subplan_partition_offsets will always be NULL. Note that the
+		 * subplan_partition_offsets array and the partitions array have the
+		 * partitions in the same order. So, while we iterate over partitions
+		 * array, we also iterate over the subplan_partition_offsets array in
+		 * order to get to know which of the result rels are present in the
+		 * UPDATE subplans.
+		 */
+		if (proute->subplan_partition_offsets &&
+			proute->subplan_partition_offsets[subplan_index] == i)
+		{
+			subplan_index++;
+			continue;
+		}
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (proute->root_tuple_slot)
+		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 	if (proute->partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 55dff5b..5ffb231 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -46,6 +46,7 @@
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -63,7 +64,11 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static ResultRelInfo *getASTriggerResultRelInfo(ModifyTableState *node);
+static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
+static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
+static inline TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -266,6 +271,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *ar_insert_trig_tcs;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -283,7 +289,6 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -332,8 +337,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					TupConvMapForLeaf(proute, saved_resultRelInfo,
+									  leaf_part_index);
 			}
 			else
 			{
@@ -346,30 +353,20 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				TupConvMapForLeaf(proute, saved_resultRelInfo,
+								  leaf_part_index);
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = proute->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = proute->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -450,6 +447,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -467,14 +465,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -623,9 +628,33 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tuples, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE.)  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	ar_insert_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR INSERT
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_insert_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 ar_insert_trig_tcs);
 
 	list_free(recheckIndexes);
 
@@ -679,6 +708,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tupleDeleted,
+		   bool processReturning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -686,6 +717,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *ar_delete_trig_tcs;
+
+	if (tupleDeleted)
+		*tupleDeleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -850,12 +885,40 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform the caller about the same */
+	if (tupleDeleted)
+		*tupleDeleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE, but only if we are capturing transition tuples.
+	 * We need to do this separately for DELETE and INSERT because they happen
+	 * on different tables.
+	 */
+	ar_delete_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR DELETE
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_delete_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 ar_delete_trig_tcs);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (processReturning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -948,6 +1011,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1019,6 +1083,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1034,22 +1099,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run on a leaf partition, we will not have
+			 * partition tuple routing set up. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (proute == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip the insert
+			 * as well; otherwise, an UPDATE could cause an increase in the
+			 * total number of rows across all partitions, which is clearly
+			 * wrong.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by the
+			 * EvalPlanQual machinery, but for an UPDATE that we've translated
+			 * into a DELETE from this partition and an INSERT into some other
+			 * partition, that's not available, because CTID chains can't span
+			 * relation boundaries.  We mimic the semantics to a limited extent
+			 * by skipping the INSERT if the DELETE fails to find a tuple. This
+			 * ensures that two concurrent attempts to UPDATE the same tuple at
+			 * the same time can't turn one tuple into two, and that an UPDATE
+			 * of a just-deleted tuple can't resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * Updates set the transition capture map only when a new subplan
+			 * is chosen.  But for inserts, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INSERT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(tupconv_map,
+											  tuple,
+											  proute->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate, true);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1477,7 +1662,6 @@ static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
 	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1500,62 +1684,140 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		int			numResultRelInfos;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		numResultRelInfos = (proute != NULL ?
-							 proute->num_partitions :
-							 mtstate->mt_nplans);
+		ExecSetupChildParentMapForTcs(mtstate);
 
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (proute != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = proute->partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array for UPDATE subplans.
+ *
+ * This map array is required to convert the tuple from the subplan result rel
+ * to the target table descriptor. This requirement arises for two independent
+ * scenarios:
+ * 1. For update-tuple-routing.
+ * 2. For capturing tuples in transition tables.
+ */
+void
+ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
+{
+	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
+	ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	TupleDesc	outdesc;
+	int			numResultRelInfos = mtstate->mt_nplans;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/*
+	 * First check if there is already a per-subplan array allocated. Even if
+	 * there is already a per-leaf map array, we won't require a per-subplan
+	 * one, since we will use the subplan offset array to convert the subplan
+	 * index to per-leaf index.
+	 */
+	if (mtstate->mt_per_subplan_tupconv_maps ||
+		(mtstate->mt_partition_tuple_routing &&
+		mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
+		return;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the target relation.  The map pointers may be NULL when
+	 * no conversion is necessary, which is hopefully a common case.
+	 */
 
+	/* Get tuple descriptor of the target rel. */
+	outdesc = RelationGetDescr(targetRelInfo->ri_RelationDesc);
+
+	mtstate->mt_per_subplan_tupconv_maps = (TupleConversionMap **)
+		palloc(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		mtstate->mt_per_subplan_tupconv_maps[i] =
+			convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+								   outdesc,
+								   gettext_noop("could not convert row type"));
+	}
+}
+
+/*
+ * Initialize the child-to-root tuple conversion map array required for
+ * capturing transition tuples.
+ *
+ * The map array can be indexed either by subplan index or by leaf-partition
+ * index.  For transition tables, we need a subplan-indexed access to the map,
+ * and where tuple-routing is present, we also require a leaf-indexed access.
+ */
+static void
+ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
+{
+	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+	/*
+	 * If partition tuple routing is set up, we will require partition-indexed
+	 * access. In that case, create the map array indexed by partition; we will
+	 * still be able to access the maps using a subplan index by converting the
+	 * subplan index to a partition index using 'subplan_partition_offsets'. If
+	 * tuple routing is not setup, it means we don't require partition-indexed
+	 * access. In that case, create just a subplan-indexed map.
+	 */
+	if (proute)
+	{
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * If a partition-indexed map array is to be created, the subplan map
+		 * array has to be NULL.  If the subplan map array is already created,
+		 * we won't be able to access the map using a partition index.
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
+
+		ExecSetupChildParentMapForLeaf(proute);
+	}
+	else
+		ExecSetupChildParentMapForSubplan(mtstate);
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static inline TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	/*
+	 * If a partition-index tuple conversion map array is allocated, we need to
+	 * first get the index into the partition array. Exactly *one* of the two
+	 * arrays is allocated. This is because if there is a partition array
+	 * required, we don't require subplan-indexed array since we can translate
+	 * subplan index into partition index. And, we create a subplan-indexed
+	 * array *only* if partition-indexed array is not required.
+	 */
+	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
+	{
+		int		leaf_index;
+		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+		/*
+		 * If subplan-indexed array is NULL, things should have been arranged
+		 * to convert the subplan index to partition index.
+		 */
+		Assert(proute && proute->subplan_partition_offsets != NULL);
+
+		leaf_index = proute->subplan_partition_offsets[whichplan];
+
+		return TupConvMapForLeaf(proute, getASTriggerResultRelInfo(mtstate),
+								 leaf_index);
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 	}
 }
 
@@ -1662,15 +1924,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1787,7 +2047,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1831,9 +2092,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partColsUpdated;
 	PartitionTupleRouting *proute = NULL;
 	int			num_partitions = 0;
 
@@ -1908,6 +2172,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1945,15 +2219,32 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	else
 		rel = mtstate->resultRelInfo->ri_RelationDesc;
 
-	/* Build state for INSERT tuple routing */
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		proute = mtstate->mt_partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(mtstate,
 										   rel, node->nominalRelation,
 										   estate);
 		num_partitions = proute->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1964,6 +2255,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMapForSubplan(mtstate);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1993,26 +2295,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2021,17 +2326,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2048,7 +2362,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2084,22 +2398,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79..747e545 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partColsUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2263,6 +2264,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 30ccc9c..99b554a 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(part_cols_updated);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df1..b35bce3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partColsUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2105,6 +2106,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partColsUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2527,6 +2529,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866..22d8b9d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partColsUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c5304b7..fd1a583 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,7 +1364,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1403,7 +1403,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283..86e7e74 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -279,6 +279,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partColsUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2373,6 +2374,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partColsUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6442,6 +6444,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partColsUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6468,6 +6471,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partColsUpdated = partColsUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dad..5387043 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6155,17 +6159,24 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets
+ *		*part_cols_updated to true if any of the root rte's updated
+ *		columns is used in the partition key either of the relation whose RTI
+ *		is specified or of any child relation.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *part_cols_updated)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (part_cols_updated)
+		*part_cols_updated = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6173,6 +6184,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (part_cols_updated)
+				*part_cols_updated = pc->part_cols_updated;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 7ef391f..e6b1534 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *part_cols_updated);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1461,16 +1462,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		part_cols_updated = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &part_cols_updated);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1487,6 +1491,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->part_cols_updated = part_cols_updated;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1563,7 +1568,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *part_cols_updated)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1578,6 +1584,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*part_cols_updated)
+		*part_cols_updated =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1617,7 +1634,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   part_cols_updated);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 48b4db7..96ab100 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3274,6 +3274,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partColsUpdated' is true if any partitioning columns are being updated,
+ *		either from the target relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3287,6 +3289,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3354,6 +3357,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partColsUpdated = partColsUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index b5df357..5aede76 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -62,11 +62,24 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								for every leaf partition in the partition tree.
  * num_partitions				Number of leaf partitions in the partition tree
  *								(= 'partitions' array length)
- * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
  *								entry for every leaf partition (required to
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the leaf
+ *								partition's rowtype to the root table's rowtype
+ *								after tuple routing is done)
+ * child_parent_map_not_required  Array of bool. True value means that a map is
+ *								determined to be not required for the given
+ *								partition. False means either we haven't yet
+ *								checked if a map is required, or it was
+ *								determined to be required.
+ * subplan_partition_offsets	int array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -79,10 +92,25 @@ typedef struct PartitionTupleRouting
 	int			num_dispatch;
 	ResultRelInfo **partitions;
 	int			num_partitions;
-	TupleConversionMap **partition_tupconv_maps;
+	TupleConversionMap **parent_child_tupconv_maps;
+	TupleConversionMap **child_parent_tupconv_maps;
+	bool	   *child_parent_map_not_required;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
+/*
+ * TupConvMapForLeaf -- For a given leaf partitions index, get the tuple
+ * conversion map.
+ *
+ * If it is already determined that the map is not required, return NULL;
+ * else create one if not already created.
+ */
+#define TupConvMapForLeaf(proute, rootRelInfo, leaf_index)					\
+	((proute)->child_parent_map_not_required[(leaf_index)] ?				\
+	NULL : CreateTupConvMapForLeaf((proute), (rootRelInfo), (leaf_index)))
+
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel, Index resultRTindex,
 							   EState *estate);
@@ -90,6 +118,13 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
+extern TupleConversionMap *CreateTupConvMapForLeaf(PartitionTupleRouting *proute,
+						ResultRelInfo *rootRelInfo, int leaf_index);
+extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot);
 extern void ExecCleanupTupleRouting(PartitionTupleRouting *proute);
 
 #endif							/* EXECPARTITION_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4bb5cb1..defd5cd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -991,8 +991,8 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_per_subplan_tupconv_maps;
+	/* Per plan map for tuple conversion from child to root */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5..baf3c07 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partColsUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8..6bf68f3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1674,6 +1674,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partColsUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2124,6 +2125,8 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		part_cols_updated;	/* is the partition key of any of
+									 * the partitioned tables updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 725694f..ef7173f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -242,6 +242,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 997b91f..29173d3 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *part_cols_updated);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..95aa0e8 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,479 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+CREATE TABLE range_parted (
 	a text,
-	b int
-) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
-create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
+) PARTITION BY RANGE (a, b);
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
+-- ok
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- ok, row movement , check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+DROP VIEW upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+                 (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+    return null;
+  end;
+$$;
+CREATE TRIGGER trans_updatetrig
+  AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+  AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+  AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, usually the
+-- conversion is not needed for inserted rows, because the original tuple is
+-- already compatible with the desired transition tuple format. But conversion
+-- happens when there is a BR trigger because the trigger can change the
+-- inserted row. So we require to install BR triggers on those child partitions
+-- where the rows are moved as part of update-row-movement.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+   NEW.b = NEW.b + 1;
+   return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,110,1,), (b,15,98,2,), (b,17,106,16,), (b,19,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,146,1,), (b,16,147,2,), (b,17,155,16,), (b,19,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+-- Case where per-partition tuple conversion map array is allocated, but the
+-- map is not required for the particular tuple that is routed, thanks to
+-- matching table attributes of the partition and the target table.
+:init_range_parted;
+UPDATE range_parted set b = 15 WHERE b = 1;
+NOTICE:  trigger = trans_updatetrig, old table = (a,1,1,1,), new table = (a,15,1,1,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_10_a_20 | a | 15 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  96 |  1 | 
+ part_c_1_100   | b | 14 |  97 |  2 | 
+ part_d_15_20   | b | 16 | 105 | 16 | 
+ part_d_15_20   | b | 18 | 105 | 19 | 
+(6 rows)
+
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+   NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
 -- ok
-update range_parted set b = b + 1 where b = 10;
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+  AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+  AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+  AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+  AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+  AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+  AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+  AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+  AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+  AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+  AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+  AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+  AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,21 +678,192 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
-create table list_parted (
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+CREATE TABLE list_parted (
 	a text,
 	b int
-) partition by list (a);
-create table list_part1  partition of list_parted for values in ('a', 'b');
-create table list_default partition of list_parted default;
-insert into list_part1 values ('a', 1);
-insert into list_default values ('d', 10);
+) PARTITION BY list (a);
+CREATE TABLE list_part1  PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
 -- fail
-update list_default set a = 'a' where a = 'd';
+UPDATE list_default set a = 'a' WHERE a = 'd';
 ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
-update list_default set a = 'x' where a = 'd';
+UPDATE list_default set a = 'x' WHERE a = 'd';
+DROP TABLE list_parted;
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Tests for BR UPDATE triggers changing the partition key.
+-----------
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b  = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+   return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+   for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+NOTICE:  Trigger: Got OLD row (2,70,1), but returning NULL
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+DROP FUNCTION func_parted_mod_b();
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+DROP TABLE non_parted;
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,14 +885,11 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok, row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
-drop table range_parted;
-drop table list_parted;
 drop table hash_parted;
 drop operator class custom_opclass using hash;
 drop function dummy_hashint4(a int4, seed int8);
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..7f49656 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,338 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+
+CREATE TABLE range_parted (
 	a text,
-	b int
-) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
-create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
+) PARTITION BY RANGE (a, b);
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+
+:show_data;
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
+-- ok
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+-- ok, row movement , check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+
+:show_data;
+
+-- cleanup
+DROP VIEW upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+                 (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+    return null;
+  end;
+$$;
+
+CREATE TRIGGER trans_updatetrig
+  AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+  AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+  AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, usually the
+-- conversion is not needed for inserted rows, because the original tuple is
+-- already compatible with the desired transition tuple format. But conversion
+-- happens when there is a BR trigger because the trigger can change the
+-- inserted row. So we require to install BR triggers on those child partitions
+-- where the rows are moved as part of update-row-movement.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+   NEW.b = NEW.b + 1;
+   return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+
+-- Case where per-partition tuple conversion map array is allocated, but the
+-- map is not required for the particular tuple that is routed, thanks to
+-- matching table attributes of the partition and the target table.
+:init_range_parted;
+UPDATE range_parted set b = 15 WHERE b = 1;
+:show_data;
+
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+   NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
 -- ok
-update range_parted set b = b + 1 where b = 10;
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+  AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+  AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+  AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+  AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+  AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+  AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+  AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+  AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+  AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+  AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+  AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+  AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+:show_data;
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
+
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,19 +447,121 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
-create table list_parted (
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+
+CREATE TABLE list_parted (
 	a text,
 	b int
-) partition by list (a);
-create table list_part1  partition of list_parted for values in ('a', 'b');
-create table list_default partition of list_parted default;
-insert into list_part1 values ('a', 1);
-insert into list_default values ('d', 10);
+) PARTITION BY list (a);
+CREATE TABLE list_part1  PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
 
 -- fail
-update list_default set a = 'a' where a = 'd';
+UPDATE list_default set a = 'a' WHERE a = 'd';
 -- ok
-update list_default set a = 'x' where a = 'd';
+UPDATE list_default set a = 'x' WHERE a = 'd';
+
+DROP TABLE list_parted;
+
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+
+
+-----------
+-- Tests for BR UPDATE triggers changing the partition key.
+-----------
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b  = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+   return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+   for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+DROP FUNCTION func_parted_mod_b();
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+DROP TABLE non_parted;
+
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
 
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
@@ -169,13 +584,12 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok, row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 
 -- cleanup
-drop table range_parted;
-drop table list_parted;
 drop table hash_parted;
 drop operator class custom_opclass using hash;
 drop function dummy_hashint4(a int4, seed int8);

#245

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: David Rowley (#243)

Re: [HACKERS] UPDATE of partition key

On 16 January 2018 at 09:17, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 16 January 2018 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jan 14, 2018 at 6:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Even where partitions are present, in the usual case where there are
Instead of a bool array, we can even make it a Bitmapset. But I think
access would become slower as compared to array, particularly because
it is going to be a heavily used function.

It probably makes little difference -- the Bitmapset will be more
compact (which saves time) but involve function calls (which cost
time).

I'm not arguing in either direction, but you'd also want to factor in
how Bitmapsets only allocate words for the maximum stored member,
which might mean multiple realloc() calls resulting in palloc/memcpy
calls. The array would just be allocated in a single chunk, although
it would be more memory and would require a memset too, however,
that's likely much cheaper than the palloc() anyway.

Right. I agree. And also a function call for knowing whether required
or not. Overall, I think especially because the data structure will be
used heavily whenever it is set up, it's better to make it an array.
In the latest patch, I have retained it as an array

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#246

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#244)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 16 January 2018 at 16:09, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

I have now embedded the above incremental patch changes into the main
patch (v37) , which is attached.

The patch had to be rebased over commit dca48d145e0e :
Remove useless lookup of root partitioned rel in ExecInitModifyTable().

In ExecInitModifyTable(), "rel" variable was needed only for INSERT.
And node->partitioned_rels is only set in UPDATE/DELETE cases, so the
extra logic of getting the root partitioned rel from
node->partitioned_rels was removed as part of that commit.

But now for update-tuple-routing, we require rel for UPDATE also. So
we need to get the root partitioned rel. But, rather than opening the
root table from node->partitioned_rels, we can re-use the
already-opened mtstate->rootResultInfo. rootResultInfo is the same as
head of partitioned_rels. I have renamed getASTriggerResultRelInfo()
to getTargetResultRelInfo(), and used it to get the root partitioned
table. The rename made sense, because it has become a function for
more general use, rather than specific to triggers-related
functionality.

Attached rebased patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

update-partition-key_v37_rebased.patchapplication/octet-stream; name=update-partition-key_v37_rebased.patchDownload

diff --git a/contrib/file_fdw/input/file_fdw.source b/contrib/file_fdw/input/file_fdw.source
index e6821d6..88cb5f2 100644
--- a/contrib/file_fdw/input/file_fdw.source
+++ b/contrib/file_fdw/input/file_fdw.source
@@ -178,6 +178,7 @@ SELECT tableoid::regclass, * FROM p1;
 SELECT tableoid::regclass, * FROM p2;
 INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
 INSERT INTO pt VALUES (2, 'xyzzy');
+UPDATE pt set a = 1 where a = 2; -- ERROR
 SELECT tableoid::regclass, * FROM pt;
 SELECT tableoid::regclass, * FROM p1;
 SELECT tableoid::regclass, * FROM p2;
diff --git a/contrib/file_fdw/output/file_fdw.source b/contrib/file_fdw/output/file_fdw.source
index 709c43e..e07bb24 100644
--- a/contrib/file_fdw/output/file_fdw.source
+++ b/contrib/file_fdw/output/file_fdw.source
@@ -344,6 +344,8 @@ SELECT tableoid::regclass, * FROM p2;
 INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
 ERROR:  cannot route inserted tuples to a foreign table
 INSERT INTO pt VALUES (2, 'xyzzy');
+UPDATE pt set a = 1 where a = 2; -- ERROR
+ERROR:  cannot route inserted tuples to a foreign table
 SELECT tableoid::regclass, * FROM pt;
  tableoid | a |   b   
 ----------+---+-------
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a4..6d97f26 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3005,6 +3005,11 @@ VALUES ('Albany', NULL, NULL, 'NY');
     foreign table partitions.
    </para>
 
+   <para>
+    Updating the partition key of a row might cause it to be moved into a
+    different partition where this row satisfies its partition constraint.
+   </para>
+
    <sect3 id="ddl-partitioning-declarative-example">
     <title>Example</title>
 
@@ -3302,9 +3307,22 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
      <listitem>
       <para>
-       An <command>UPDATE</command> that causes a row to move from one partition to
-       another fails, because the new value of the row fails to satisfy the
-       implicit partition constraint of the original partition.
+       When an <command>UPDATE</command> causes a row to move from one
+       partition to another, there is a chance that another concurrent
+       <command>UPDATE</command> or <command>DELETE</command> misses this row.
+       Suppose, session 1 is performing an <command>UPDATE</command> on a
+       partition key, and meanwhile a concurrent session 2 for which this row
+       is visible, performs an <command>UPDATE</command> or
+       <command>DELETE</command> operation on this row. Session 2 can silently
+       miss the row if the row is deleted from the partition due to session
+       1's activity.  In such case, session 2's
+       <command>UPDATE</command>/<command>DELETE</command>, being unaware of
+       the row movement, interprets that the row has just been deleted so there
+       is nothing to be done for this row. Whereas, in the usual case where the
+       table is not partitioned, or where there is no row movement, session 2
+       would have identified the newly updated row and carried out the
+       <command>UPDATE</command>/<command>DELETE</command> on this new row
+       version.
       </para>
      </listitem>
 
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index c0d0f71..296e301 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -282,10 +282,16 @@ UPDATE <replaceable class="parameter">count</replaceable>
 
   <para>
    In the case of a partitioned table, updating a row might cause it to no
-   longer satisfy the partition constraint.  Since there is no provision to
-   move the row to the partition appropriate to the new value of its
-   partitioning key, an error will occur in this case.  This can also happen
-   when updating a partition directly.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur. The error
+   will also occur when updating a partition directly. Behind the scenes, the
+   row movement is actually a <command>DELETE</command> and
+   <command>INSERT</command> operation. However, there is a possibility that a
+   concurrent <command>UPDATE</command> or <command>DELETE</command> on the
+   same row may miss this row. For details see the section
+   <xref linkend="ddl-partitioning-declarative-limitations"/>.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/trigger.sgml b/doc/src/sgml/trigger.sgml
index bf5d3f9..8f83e6a 100644
--- a/doc/src/sgml/trigger.sgml
+++ b/doc/src/sgml/trigger.sgml
@@ -154,6 +154,29 @@
    </para>
 
    <para>
+    If an <command>UPDATE</command> on a partitioned table causes a row to move
+    to another partition, it will be performed as a <command>DELETE</command>
+    from the original partition followed by an <command>INSERT</command> into
+    the new partition. In this case, all row-level <literal>BEFORE</literal>
+    <command>UPDATE</command> triggers and all row-level
+    <literal>BEFORE</literal> <command>DELETE</command> triggers are fired on
+    the original partition. Then all row-level <literal>BEFORE</literal>
+    <command>INSERT</command> triggers are fired on the destination partition.
+    The possibility of surprising outcomes should be considered when all these
+    triggers affect the row being moved. As far as <literal>AFTER ROW</literal>
+    triggers are concerned, <literal>AFTER</literal> <command>DELETE</command>
+    and <literal>AFTER</literal> <command>INSERT</command> triggers are
+    applied; but <literal>AFTER</literal> <command>UPDATE</command> triggers
+    are not applied because the <command>UPDATE</command> has been converted to
+    a <command>DELETE</command> and an <command>INSERT</command>. As far as
+    statement-level triggers are concerned, none of the
+    <command>DELETE</command> or <command>INSERT</command> triggers are fired,
+    even if row movement occurs; only the <command>UPDATE</command> triggers
+    defined on the target table used in the <command>UPDATE</command> statement
+    will be fired.
+   </para>
+
+   <para>
     Trigger functions invoked by per-statement triggers should always
     return <symbol>NULL</symbol>. Trigger functions invoked by per-row
     triggers can return a table row (a value of
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6bfca2a..04a24c6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -170,7 +170,6 @@ typedef struct CopyStateData
 	PartitionTupleRouting *partition_tuple_routing;
 
 	TransitionCaptureState *transition_capture;
-	TupleConversionMap **transition_tupconv_maps;
 
 	/*
 	 * These variables are used to reduce overhead in textual COPY FROM.
@@ -2481,19 +2480,7 @@ CopyFrom(CopyState cstate)
 		 * tuple).
 		 */
 		if (cstate->transition_capture != NULL)
-		{
-			int			i;
-
-			cstate->transition_tupconv_maps = (TupleConversionMap **)
-				palloc0(sizeof(TupleConversionMap *) * proute->num_partitions);
-			for (i = 0; i < proute->num_partitions; ++i)
-			{
-				cstate->transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(proute->partitions[i]->ri_RelationDesc),
-										   RelationGetDescr(cstate->rel),
-										   gettext_noop("could not convert row type"));
-			}
-		}
+			ExecSetupChildParentMapForLeaf(proute);
 	}
 
 	/*
@@ -2587,7 +2574,6 @@ CopyFrom(CopyState cstate)
 		if (cstate->partition_tuple_routing)
 		{
 			int			leaf_part_index;
-			TupleConversionMap *map;
 			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
 
 			/*
@@ -2651,7 +2637,8 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						cstate->transition_tupconv_maps[leaf_part_index];
+						TupConvMapForLeaf(proute, saved_resultRelInfo,
+										  leaf_part_index);
 				}
 				else
 				{
@@ -2668,23 +2655,10 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = proute->partition_tupconv_maps[leaf_part_index];
-			if (map)
-			{
-				Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-				tuple = do_convert_tuple(tuple, map);
-
-				/*
-				 * We must use the partition's tuple descriptor from this
-				 * point on.  Use a dedicated slot from this point on until
-				 * we're finished dealing with the partition.
-				 */
-				slot = proute->partition_tuple_slot;
-				Assert(slot != NULL);
-				ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-				ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-			}
+			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+											  tuple,
+											  proute->partition_tuple_slot,
+											  &slot);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 1c488c3..e8af18e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2854,8 +2854,13 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	{
 		HeapTuple	trigtuple;
 
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
+		/*
+		 * Note: if the UPDATE is converted into a DELETE+INSERT as part of
+		 * update-partition-key operation, then this function is also called
+		 * separately for DELETE and INSERT to capture transition table rows.
+		 * In such case, either old tuple or new tuple can be NULL.
+		 */
+		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
 			trigtuple = GetTupleForTrigger(estate,
 										   NULL,
 										   relinfo,
@@ -5414,7 +5419,12 @@ AfterTriggerPendingOnRel(Oid relid)
  *	triggers actually need to be queued.  It is also called after each row,
  *	even if there are no triggers for that event, if there are any AFTER
  *	STATEMENT triggers for the statement which use transition tables, so that
- *	the transition tuplestores can be built.
+ *	the transition tuplestores can be built.  Furthermore, if the transition
+ *	capture is happening for UPDATEd rows being moved to another partition due
+ *	to the partition-key being changed, then this function is called once when
+ *	the row is deleted (to capture OLD row), and once when the row is inserted
+ *	into another partition (to capture NEW row).  This is done separately because
+ *	DELETE and INSERT happen on different tables.
  *
  *	Transition tuplestores are built now, rather than when events are pulled
  *	off of the queue because AFTER ROW triggers are allowed to select from the
@@ -5463,12 +5473,25 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		bool		update_new_table = transition_capture->tcs_update_new_table;
 		bool		insert_new_table = transition_capture->tcs_insert_new_table;;
 
-		if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_old_table))
+		/*
+		 * For INSERT events newtup should be non-NULL, for DELETE events
+		 * oldtup should be non-NULL, whereas for UPDATE events normally both
+		 * oldtup and newtup are non-NULL.  But for UPDATE events fired for
+		 * capturing transition tuples during UPDATE partition-key row
+		 * movement, oldtup is NULL when the event is for a row being inserted,
+		 * whereas newtup is NULL when the event is for a row being deleted.
+		 */
+		Assert(!(event == TRIGGER_EVENT_DELETE && delete_old_table &&
+				 oldtup == NULL));
+		Assert(!(event == TRIGGER_EVENT_INSERT && insert_new_table &&
+				 newtup == NULL));
+
+		if (oldtup != NULL &&
+			((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+			 (event == TRIGGER_EVENT_UPDATE && update_old_table)))
 		{
 			Tuplestorestate *old_tuplestore;
 
-			Assert(oldtup != NULL);
 			old_tuplestore = transition_capture->tcs_private->old_tuplestore;
 
 			if (map != NULL)
@@ -5481,12 +5504,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			else
 				tuplestore_puttuple(old_tuplestore, oldtup);
 		}
-		if ((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
-			(event == TRIGGER_EVENT_UPDATE && update_new_table))
+		if (newtup != NULL &&
+			((event == TRIGGER_EVENT_INSERT && insert_new_table) ||
+			(event == TRIGGER_EVENT_UPDATE && update_new_table)))
 		{
 			Tuplestorestate *new_tuplestore;
 
-			Assert(newtup != NULL);
 			new_tuplestore = transition_capture->tcs_private->new_tuplestore;
 
 			if (original_insert_tuple != NULL)
@@ -5502,11 +5525,18 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				tuplestore_puttuple(new_tuplestore, newtup);
 		}
 
-		/* If transition tables are the only reason we're here, return. */
+		/*
+		 * If transition tables are the only reason we're here, return. As
+		 * mentioned above, we can also be here during update tuple routing in
+		 * presence of transition tables, in which case this function is called
+		 * separately for oldtup and newtup, so we expect exactly one of them
+		 * to be NULL.
+		 */
 		if (trigdesc == NULL ||
 			(event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) ||
 			(event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) ||
-			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row))
+			(event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) ||
+			(event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL))))
 			return;
 	}
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 8c0d2df..a0a611c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -54,7 +54,11 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	List	   *leaf_parts;
 	ListCell   *cell;
 	int			i;
-	ResultRelInfo *leaf_part_rri;
+	ResultRelInfo *leaf_part_arr = NULL,
+				  *update_rri = NULL;
+	int			num_update_rri = 0,
+				update_rri_index = 0;
+	bool		is_update = false;
 	PartitionTupleRouting *proute;
 
 	/*
@@ -69,10 +73,56 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	proute->num_partitions = list_length(leaf_parts);
 	proute->partitions = (ResultRelInfo **) palloc(proute->num_partitions *
 												   sizeof(ResultRelInfo *));
-	proute->partition_tupconv_maps =
+	proute->parent_child_tupconv_maps =
 		(TupleConversionMap **) palloc0(proute->num_partitions *
 										sizeof(TupleConversionMap *));
 
+	/* Initialization specific to update */
+	if (mtstate && mtstate->operation == CMD_UPDATE)
+	{
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+
+		is_update = true;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+
+		/*
+		 * For updates, if the leaf partition is already present in the
+		 * per-subplan result rels, we re-use that rather than initialize a new
+		 * result rel. The per-subplan resultrels and the resultrels of the
+		 * leaf partitions are both in the same canonical order. So while going
+		 * through the leaf partition oids, we need to keep track of the next
+		 * per-subplan result rel to be looked for in the leaf partition
+		 * resultrels. So, update_rri_index should be set to the first
+		 * per-subplan result rel (i.e. 0), and then should be shifted as we
+		 * find them one by one while scanning the leaf partition oids. (It is
+		 * already set to 0 during initialization, above).
+		 */
+
+		/*
+		 * Prepare for generating the mapping from subplan result rels to leaf
+		 * partition position.
+		 */
+		proute->subplan_partition_offsets =
+			palloc(num_update_rri * sizeof(int));
+
+		/*
+		 * For UPDATEs, we require an additional tuple slot for storing
+		 * transient tuples that are converted to the root table descriptor.
+		 */
+		proute->root_tuple_slot = MakeTupleTableSlot();
+	}
+	else
+	{
+		/*
+		 * For inserts, we need to create all new result rels, so avoid
+		 * repeated pallocs by allocating memory for all the result rels in
+		 * bulk.
+		 */
+		leaf_part_arr = (ResultRelInfo *) palloc0(proute->num_partitions *
+												  sizeof(ResultRelInfo));
+	}
+
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.  It is attached to the caller-specified node
@@ -81,38 +131,81 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 	 */
 	proute->partition_tuple_slot = MakeTupleTableSlot();
 
-	leaf_part_rri = (ResultRelInfo *) palloc0(proute->num_partitions *
-											  sizeof(ResultRelInfo));
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
-		Relation	partrel;
+		ResultRelInfo *leaf_part_rri;
+		Relation	partrel = NULL;
 		TupleDesc	part_tupdesc;
+		Oid			leaf_oid = lfirst_oid(cell);
+
+		if (is_update)
+		{
+			/* Is this leaf partition present in the update resultrel? */
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				leaf_part_rri = &update_rri[update_rri_index];
+				partrel = leaf_part_rri->ri_RelationDesc;
+
+				/*
+				 * This is required when we convert the partition's tuple to be
+				 * compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan UPDATE result
+				 * rels, this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/*
+				 * Save the position of this update rel in the leaf partitions
+				 * array
+				 */
+				proute->subplan_partition_offsets[update_rri_index] = i;
+
+				update_rri_index++;
+			}
+			else
+				leaf_part_rri = (ResultRelInfo *) palloc0(sizeof(ResultRelInfo));
+		}
+		else
+		{
+			/* For INSERTs, we already have an array of result rels allocated */
+			leaf_part_rri = &leaf_part_arr[i];
+		}
 
 		/*
-		 * We locked all the partitions above including the leaf partitions.
-		 * Note that each of the relations in proute->partitions are
-		 * eventually closed by the caller.
+		 * If we didn't open the partition rel, it means we haven't initialized
+		 * the result rel either.
 		 */
-		partrel = heap_open(lfirst_oid(cell), NoLock);
+		if (!partrel)
+		{
+			/*
+			 * We locked all the partitions above including the leaf
+			 * partitions. Note that each of the newly opened relations in
+			 * proute->partitions are eventually closed by the caller.
+			 */
+			partrel = heap_open(leaf_oid, NoLock);
+			InitResultRelInfo(leaf_part_rri,
+							  partrel,
+							  resultRTindex,
+							  rel,
+							  estate->es_instrument);
+		}
+
 		part_tupdesc = RelationGetDescr(partrel);
 
 		/*
 		 * Save a tuple conversion map to convert a tuple routed to this
 		 * partition from the parent's type to the partition's.
 		 */
-		proute->partition_tupconv_maps[i] =
+		proute->parent_child_tupconv_maps[i] =
 			convert_tuples_by_name(tupDesc, part_tupdesc,
 								   gettext_noop("could not convert row type"));
 
-		InitResultRelInfo(leaf_part_rri,
-						  partrel,
-						  resultRTindex,
-						  rel,
-						  estate->es_instrument);
-
 		/*
-		 * Verify result relation is a valid target for INSERT.
+		 * Verify result relation is a valid target for an INSERT.  An UPDATE
+		 * of a partition-key becomes a DELETE+INSERT operation, so this check
+		 * is still required when the operation is CMD_UPDATE.
 		 */
 		CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
@@ -132,10 +225,16 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		estate->es_leaf_result_relations =
 			lappend(estate->es_leaf_result_relations, leaf_part_rri);
 
-		proute->partitions[i] = leaf_part_rri++;
+		proute->partitions[i] = leaf_part_rri;
 		i++;
 	}
 
+	/*
+	 * For UPDATE, we should have found all the per-subplan resultrels in the
+	 * leaf partitions.
+	 */
+	Assert(!is_update || update_rri_index == num_update_rri);
+
 	return proute;
 }
 
@@ -259,6 +358,98 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 }
 
 /*
+ * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
+ * child-to-root tuple conversion map array.
+ *
+ * This map is required for capturing transition tuples when the target table
+ * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
+ * we need to convert it from the leaf partition to the target table
+ * descriptor.
+ */
+void
+ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+{
+	Assert(proute != NULL);
+
+	/*
+	 * These array elements gets filled up with maps on an on-demand basis.
+	 * Initially just set all of them to NULL.
+	 */
+	proute->child_parent_tupconv_maps =
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
+										proute->num_partitions);
+
+	/* Same is the case for this array. All the values are set to false */
+	proute->child_parent_map_not_required =
+		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+}
+
+/*
+ * CreateTupConvMapForLeaf -- For a given leaf partition index, create a tuple
+ * conversion map, if not already allocated.
+ *
+ * This function should be called only after it is found that
+ * child_parent_map_not_required is false for the given partition.
+ */
+TupleConversionMap *
+CreateTupConvMapForLeaf(PartitionTupleRouting *proute,
+						ResultRelInfo *rootRelInfo, int leaf_index)
+{
+	ResultRelInfo **resultRelInfos = proute->partitions;
+	TupleConversionMap **map;
+
+	Assert(proute->child_parent_tupconv_maps != NULL);
+	map = proute->child_parent_tupconv_maps + leaf_index;
+
+	/*
+	 * Either the map is already allocated, or it is yet to be determined if it
+	 * is required.
+	 */
+	if (!*map)
+	{
+		*map =
+			convert_tuples_by_name(RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc),
+								   RelationGetDescr(rootRelInfo->ri_RelationDesc),
+								   gettext_noop("could not convert row type"));
+
+		/* Update the array element with the new info */
+		proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	}
+	return *map;
+}
+
+/*
+ * ConvertPartitionTupleSlot -- convenience function for tuple conversion using
+ * 'map'. The tuple, if converted, is stored in 'new_slot', and 'p_my_slot' is
+ * updated with the 'new_slot'. 'new_slot' typically should be one of the
+ * dedicated partition tuple slots. If map is NULL, keeps p_my_slot unchanged.
+ *
+ * Returns the converted tuple, unless map is NULL, in which case original
+ * tuple is returned unmodified.
+ */
+HeapTuple
+ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot)
+{
+	if (!map)
+		return tuple;
+
+	tuple = do_convert_tuple(tuple, map);
+
+	/*
+	 * Change the partition tuple slot descriptor, as per converted tuple.
+	 */
+	*p_my_slot = new_slot;
+	Assert(new_slot != NULL);
+	ExecSetSlotDescriptor(new_slot, map->outdesc);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+
+	return tuple;
+}
+
+/*
  * ExecCleanupTupleRouting -- Clean up objects allocated for partition tuple
  * routing.
  *
@@ -268,6 +459,7 @@ void
 ExecCleanupTupleRouting(PartitionTupleRouting * proute)
 {
 	int			i;
+	int			subplan_index;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -284,15 +476,34 @@ ExecCleanupTupleRouting(PartitionTupleRouting * proute)
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
-	for (i = 0; i < proute->num_partitions; i++)
+	for (subplan_index = i = 0; i < proute->num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
+		/*
+		 * If this result rel is one of the UPDATE subplan result rels, let
+		 * ExecEndPlan() close it. For INSERT or COPY,
+		 * proute->subplan_partition_offsets will always be NULL. Note that the
+		 * subplan_partition_offsets array and the partitions array have the
+		 * partitions in the same order. So, while we iterate over partitions
+		 * array, we also iterate over the subplan_partition_offsets array in
+		 * order to get to know which of the result rels are present in the
+		 * UPDATE subplans.
+		 */
+		if (proute->subplan_partition_offsets &&
+			proute->subplan_partition_offsets[subplan_index] == i)
+		{
+			subplan_index++;
+			continue;
+		}
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 
-	/* Release the standalone partition tuple descriptor, if any */
+	/* Release the standalone partition tuple descriptors, if any */
+	if (proute->root_tuple_slot)
+		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 	if (proute->partition_tuple_slot)
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c5eca1b..61e0959 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -62,7 +62,11 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 EState *estate,
 					 bool canSetTag,
 					 TupleTableSlot **returning);
-
+static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
+static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
+static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
+static inline TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
+													int whichplan);
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
  * target relation's rowtype
@@ -265,6 +269,7 @@ ExecInsert(ModifyTableState *mtstate,
 	Oid			newId;
 	List	   *recheckIndexes = NIL;
 	TupleTableSlot *result = NULL;
+	TransitionCaptureState *ar_insert_trig_tcs;
 
 	/*
 	 * get the heap tuple out of the tuple table slot, making sure we have a
@@ -282,7 +287,6 @@ ExecInsert(ModifyTableState *mtstate,
 	{
 		int			leaf_part_index;
 		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-		TupleConversionMap *map;
 
 		/*
 		 * Away we go ... If we end up not finding a partition after all,
@@ -331,8 +335,10 @@ ExecInsert(ModifyTableState *mtstate,
 				 * back to tuplestore format.
 				 */
 				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+
 				mtstate->mt_transition_capture->tcs_map =
-					mtstate->mt_transition_tupconv_maps[leaf_part_index];
+					TupConvMapForLeaf(proute, saved_resultRelInfo,
+									  leaf_part_index);
 			}
 			else
 			{
@@ -345,30 +351,20 @@ ExecInsert(ModifyTableState *mtstate,
 			}
 		}
 		if (mtstate->mt_oc_transition_capture != NULL)
+		{
 			mtstate->mt_oc_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[leaf_part_index];
+				TupConvMapForLeaf(proute, saved_resultRelInfo,
+								  leaf_part_index);
+		}
 
 		/*
 		 * We might need to convert from the parent rowtype to the partition
 		 * rowtype.
 		 */
-		map = proute->partition_tupconv_maps[leaf_part_index];
-		if (map)
-		{
-			Relation	partrel = resultRelInfo->ri_RelationDesc;
-
-			tuple = do_convert_tuple(tuple, map);
-
-			/*
-			 * We must use the partition's tuple descriptor from this point
-			 * on, until we're finished dealing with the partition. Use the
-			 * dedicated slot for that.
-			 */
-			slot = proute->partition_tuple_slot;
-			Assert(slot != NULL);
-			ExecSetSlotDescriptor(slot, RelationGetDescr(partrel));
-			ExecStoreTuple(tuple, slot, InvalidBuffer, true);
-		}
+		tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot);
 	}
 
 	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -449,6 +445,7 @@ ExecInsert(ModifyTableState *mtstate,
 	}
 	else
 	{
+		WCOKind		wco_kind;
 		/*
 		 * We always check the partition constraint, including when the tuple
 		 * got here via tuple-routing.  However we don't need to in the latter
@@ -466,14 +463,21 @@ ExecInsert(ModifyTableState *mtstate,
 		tuple->t_tableOid = RelationGetRelid(resultRelationDesc);
 
 		/*
-		 * Check any RLS INSERT WITH CHECK policies
+		 * Check any RLS WITH CHECK policies.
 		 *
+		 * Normally we should check INSERT policies. But if the insert is part
+		 * of update-row-movement, we should instead check UPDATE policies,
+		 * because we are executing policies defined on the target table, and
+		 * not those defined on the child partitions.
+		 */
+		wco_kind = (mtstate->operation == CMD_UPDATE) ?
+					WCO_RLS_UPDATE_CHECK : WCO_RLS_INSERT_CHECK;
+		/*
 		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
 		 * we are looking for at this point.
 		 */
 		if (resultRelInfo->ri_WithCheckOptions != NIL)
-			ExecWithCheckOptions(WCO_RLS_INSERT_CHECK,
-								 resultRelInfo, slot, estate);
+			ExecWithCheckOptions(wco_kind, resultRelInfo, slot, estate);
 
 		/*
 		 * No need though if the tuple has been routed, and a BR trigger
@@ -622,9 +626,33 @@ ExecInsert(ModifyTableState *mtstate,
 		setLastTid(&(tuple->t_self));
 	}
 
+	/*
+	 * If this INSERT is part of a partition-key-UPDATE and we are capturing
+	 * transition tuples, put this row into the transition NEW TABLE.
+	 * (Similarly we need to add the deleted row in OLD TABLE.)  We need to do
+	 * this separately for DELETE and INSERT because they happen on different
+	 * tables.
+	 */
+	ar_insert_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_new_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo, NULL,
+							 NULL,
+							 tuple,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR INSERT
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_insert_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW INSERT Triggers */
 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes,
-						 mtstate->mt_transition_capture);
+						 ar_insert_trig_tcs);
 
 	list_free(recheckIndexes);
 
@@ -678,6 +706,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
+		   bool *tupleDeleted,
+		   bool processReturning,
 		   bool canSetTag)
 {
 	ResultRelInfo *resultRelInfo;
@@ -685,6 +715,10 @@ ExecDelete(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	TupleTableSlot *slot = NULL;
+	TransitionCaptureState *ar_delete_trig_tcs;
+
+	if (tupleDeleted)
+		*tupleDeleted = false;
 
 	/*
 	 * get information on the (current) result relation
@@ -849,12 +883,40 @@ ldelete:;
 	if (canSetTag)
 		(estate->es_processed)++;
 
+	/* The delete has actually happened, so inform the caller about the same */
+	if (tupleDeleted)
+		*tupleDeleted = true;
+
+	/*
+	 * In case this is part of update tuple routing, put this row into the
+	 * transition OLD TABLE, but only if we are capturing transition tuples.
+	 * We need to do this separately for DELETE and INSERT because they happen
+	 * on different tables.
+	 */
+	ar_delete_trig_tcs = mtstate->mt_transition_capture;
+	if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture
+		&& mtstate->mt_transition_capture->tcs_update_old_table)
+	{
+		ExecARUpdateTriggers(estate, resultRelInfo,
+							 tupleid,
+							 oldtuple,
+							 NULL,
+							 NULL,
+							 mtstate->mt_transition_capture);
+
+		/*
+		 * We've already captured the NEW TABLE row, so make sure any AR DELETE
+		 * trigger fired below doesn't capture it again.
+		 */
+		ar_delete_trig_tcs = NULL;
+	}
+
 	/* AFTER ROW DELETE Triggers */
 	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
-						 mtstate->mt_transition_capture);
+						 ar_delete_trig_tcs);
 
-	/* Process RETURNING if present */
-	if (resultRelInfo->ri_projectReturning)
+	/* Process RETURNING if present and if requested */
+	if (processReturning && resultRelInfo->ri_projectReturning)
 	{
 		/*
 		 * We have to put the target tuple into a slot, which means first we
@@ -947,6 +1009,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	HTSU_Result result;
 	HeapUpdateFailureData hufd;
 	List	   *recheckIndexes = NIL;
+	TupleConversionMap *saved_tcs_map = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1018,6 +1081,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	else
 	{
 		LockTupleMode lockmode;
+		bool		partition_constraint_failed;
 
 		/*
 		 * Constraints might reference the tableoid column, so initialize
@@ -1033,22 +1097,142 @@ ExecUpdate(ModifyTableState *mtstate,
 		 * (We don't need to redo triggers, however.  If there are any BEFORE
 		 * triggers then trigger.c will have done heap_lock_tuple to lock the
 		 * correct tuple, so there's no need to do them again.)
-		 *
-		 * ExecWithCheckOptions() will skip any WCOs which are not of the kind
-		 * we are looking for at this point.
 		 */
 lreplace:;
-		if (resultRelInfo->ri_WithCheckOptions != NIL)
+
+		/*
+		 * If partition constraint fails, this row might get moved to another
+		 * partition, in which case we should check the RLS CHECK policy just
+		 * before inserting into the new partition, rather than doing it here.
+		 * This is because, a trigger on that partition might again change the
+		 * row.  So skip the WCO checks if the partition constraint fails.
+		 */
+		partition_constraint_failed =
+			resultRelInfo->ri_PartitionCheck &&
+			!ExecPartitionCheck(resultRelInfo, slot, estate);
+
+		if (!partition_constraint_failed &&
+			resultRelInfo->ri_WithCheckOptions != NIL)
+		{
+			/*
+			 * ExecWithCheckOptions() will skip any WCOs which are not of the
+			 * kind we are looking for at this point.
+			 */
 			ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK,
 								 resultRelInfo, slot, estate);
+		}
+
+		/*
+		 * If a partition check failed, try to move the row into the right
+		 * partition.
+		 */
+		if (partition_constraint_failed)
+		{
+			bool		tuple_deleted;
+			TupleTableSlot *ret_slot;
+			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+			int			map_index;
+			TupleConversionMap *tupconv_map;
+
+			/*
+			 * When an UPDATE is run on a leaf partition, we will not have
+			 * partition tuple routing set up. In that case, fail with
+			 * partition constraint violation error.
+			 */
+			if (proute == NULL)
+				ExecPartitionCheckEmitError(resultRelInfo, slot, estate);
+
+			/* Do the row movement. */
+
+			/*
+			 * Skip RETURNING processing for DELETE. We want to return rows
+			 * from INSERT.
+			 */
+			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
+					   &tuple_deleted, false, false);
+
+			/*
+			 * For some reason if DELETE didn't happen (e.g. trigger prevented
+			 * it, or it was already deleted by self, or it was concurrently
+			 * deleted by another transaction), then we should skip the insert
+			 * as well; otherwise, an UPDATE could cause an increase in the
+			 * total number of rows across all partitions, which is clearly
+			 * wrong.
+			 *
+			 * For a normal UPDATE, the case where the tuple has been the
+			 * subject of a concurrent UPDATE or DELETE would be handled by the
+			 * EvalPlanQual machinery, but for an UPDATE that we've translated
+			 * into a DELETE from this partition and an INSERT into some other
+			 * partition, that's not available, because CTID chains can't span
+			 * relation boundaries.  We mimic the semantics to a limited extent
+			 * by skipping the INSERT if the DELETE fails to find a tuple. This
+			 * ensures that two concurrent attempts to UPDATE the same tuple at
+			 * the same time can't turn one tuple into two, and that an UPDATE
+			 * of a just-deleted tuple can't resurrect it.
+			 */
+			if (!tuple_deleted)
+				return NULL;
+
+			/*
+			 * Updates set the transition capture map only when a new subplan
+			 * is chosen.  But for inserts, it is set for each row. So after
+			 * INSERT, we need to revert back to the map created for UPDATE;
+			 * otherwise the next UPDATE will incorrectly use the one created
+			 * for INSERT.  So first save the one created for UPDATE.
+			 */
+			if (mtstate->mt_transition_capture)
+				saved_tcs_map = mtstate->mt_transition_capture->tcs_map;
+
+			/*
+			 * resultRelInfo is one of the per-subplan resultRelInfos.  So we
+			 * should convert the tuple into root's tuple descriptor, since
+			 * ExecInsert() starts the search from root.  The tuple conversion
+			 * map list is in the order of mtstate->resultRelInfo[], so to
+			 * retrieve the one for this resultRel, we need to know the
+			 * position of the resultRel in mtstate->resultRelInfo[].
+			 */
+			map_index = resultRelInfo - mtstate->resultRelInfo;
+			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
+			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
+			tuple = ConvertPartitionTupleSlot(tupconv_map,
+											  tuple,
+											  proute->root_tuple_slot,
+											  &slot);
+
+
+			/*
+			 * For ExecInsert(), make it look like we are inserting into the
+			 * root.
+			 */
+			Assert(mtstate->rootResultRelInfo != NULL);
+			estate->es_result_relation_info = mtstate->rootResultRelInfo;
+
+			ret_slot = ExecInsert(mtstate, slot, planSlot, NULL,
+								  ONCONFLICT_NONE, estate, canSetTag);
+
+			/*
+			 * Revert back the active result relation and the active transition
+			 * capture map that we changed above.
+			 */
+			estate->es_result_relation_info = resultRelInfo;
+			if (mtstate->mt_transition_capture)
+			{
+				mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
+				mtstate->mt_transition_capture->tcs_map = saved_tcs_map;
+			}
+			return ret_slot;
+		}
 
 		/*
 		 * Check the constraints of the tuple.  Note that we pass the same
 		 * slot for the orig_slot argument, because unlike ExecInsert(), no
 		 * tuple-routing is performed here, hence the slot remains unchanged.
+		 * We've already checked the partition constraint above; however, we
+		 * must still ensure the tuple passes all other constraints, so we will
+		 * call ExecConstraints() and have it validate all remaining checks.
 		 */
-		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
-			ExecConstraints(resultRelInfo, slot, estate, true);
+		if (resultRelationDesc->rd_att->constr)
+			ExecConstraints(resultRelInfo, slot, estate, false);
 
 		/*
 		 * replace the heap tuple
@@ -1418,17 +1602,20 @@ fireBSTriggers(ModifyTableState *node)
 }
 
 /*
- * Return the ResultRelInfo for which we will fire AFTER STATEMENT triggers.
- * This is also the relation into whose tuple format all captured transition
- * tuples must be converted.
+ * Return the target rel ResultRelInfo.
+ *
+ * This relation is the same as :
+ * - the relation for which we will fire AFTER STATEMENT triggers.
+ * - the relation into whose tuple format all captured transition tuples must
+ *   be converted.
+ * - the root partitioned table.
  */
 static ResultRelInfo *
-getASTriggerResultRelInfo(ModifyTableState *node)
+getTargetResultRelInfo(ModifyTableState *node)
 {
 	/*
-	 * If the node modifies a partitioned table, we must fire its triggers.
-	 * Note that in that case, node->resultRelInfo points to the first leaf
-	 * partition, not the root table.
+	 * Note that if the node modifies a partitioned table, node->resultRelInfo
+	 * points to the first leaf partition, not the root table.
 	 */
 	if (node->rootResultRelInfo != NULL)
 		return node->rootResultRelInfo;
@@ -1442,7 +1629,7 @@ getASTriggerResultRelInfo(ModifyTableState *node)
 static void
 fireASTriggers(ModifyTableState *node)
 {
-	ResultRelInfo *resultRelInfo = getASTriggerResultRelInfo(node);
+	ResultRelInfo *resultRelInfo = getTargetResultRelInfo(node);
 
 	switch (node->operation)
 	{
@@ -1475,8 +1662,7 @@ fireASTriggers(ModifyTableState *node)
 static void
 ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 {
-	ResultRelInfo *targetRelInfo = getASTriggerResultRelInfo(mtstate);
-	int			i;
+	ResultRelInfo *targetRelInfo = getTargetResultRelInfo(mtstate);
 
 	/* Check for transition tables on the directly targeted relation. */
 	mtstate->mt_transition_capture =
@@ -1499,62 +1685,140 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		int			numResultRelInfos;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		numResultRelInfos = (proute != NULL ?
-							 proute->num_partitions :
-							 mtstate->mt_nplans);
+		ExecSetupChildParentMapForTcs(mtstate);
 
 		/*
-		 * Build array of conversion maps from each child's TupleDesc to the
-		 * one used in the tuplestore.  The map pointers may be NULL when no
-		 * conversion is necessary, which is hopefully a common case for
-		 * partitions.
+		 * Install the conversion map for the first plan for UPDATE and DELETE
+		 * operations.  It will be advanced each time we switch to the next
+		 * plan.  (INSERT operations set it every time, so we need not update
+		 * mtstate->mt_oc_transition_capture here.)
 		 */
-		mtstate->mt_transition_tupconv_maps = (TupleConversionMap **)
-			palloc0(sizeof(TupleConversionMap *) * numResultRelInfos);
+		if (mtstate->mt_transition_capture && mtstate->operation != CMD_INSERT)
+			mtstate->mt_transition_capture->tcs_map =
+				tupconv_map_for_subplan(mtstate, 0);
+	}
+}
 
-		/* Choose the right set of partitions */
-		if (proute != NULL)
-		{
-			/*
-			 * For tuple routing among partitions, we need TupleDescs based on
-			 * the partition routing table.
-			 */
-			ResultRelInfo **resultRelInfos = proute->partitions;
+/*
+ * Initialize the child-to-root tuple conversion map array for UPDATE subplans.
+ *
+ * This map array is required to convert the tuple from the subplan result rel
+ * to the target table descriptor. This requirement arises for two independent
+ * scenarios:
+ * 1. For update-tuple-routing.
+ * 2. For capturing tuples in transition tables.
+ */
+void
+ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
+{
+	ResultRelInfo *targetRelInfo = getTargetResultRelInfo(mtstate);
+	ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	TupleDesc	outdesc;
+	int			numResultRelInfos = mtstate->mt_nplans;
+	int			i;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i]->ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
-		else
-		{
-			/* Otherwise we need the ResultRelInfo for each subplan. */
-			ResultRelInfo *resultRelInfos = mtstate->resultRelInfo;
+	/*
+	 * First check if there is already a per-subplan array allocated. Even if
+	 * there is already a per-leaf map array, we won't require a per-subplan
+	 * one, since we will use the subplan offset array to convert the subplan
+	 * index to per-leaf index.
+	 */
+	if (mtstate->mt_per_subplan_tupconv_maps ||
+		(mtstate->mt_partition_tuple_routing &&
+		mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
+		return;
 
-			for (i = 0; i < numResultRelInfos; ++i)
-			{
-				mtstate->mt_transition_tupconv_maps[i] =
-					convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
-										   RelationGetDescr(targetRelInfo->ri_RelationDesc),
-										   gettext_noop("could not convert row type"));
-			}
-		}
+	/*
+	 * Build array of conversion maps from each child's TupleDesc to the one
+	 * used in the target relation.  The map pointers may be NULL when
+	 * no conversion is necessary, which is hopefully a common case.
+	 */
 
+	/* Get tuple descriptor of the target rel. */
+	outdesc = RelationGetDescr(targetRelInfo->ri_RelationDesc);
+
+	mtstate->mt_per_subplan_tupconv_maps = (TupleConversionMap **)
+		palloc(sizeof(TupleConversionMap *) * numResultRelInfos);
+
+	for (i = 0; i < numResultRelInfos; ++i)
+	{
+		mtstate->mt_per_subplan_tupconv_maps[i] =
+			convert_tuples_by_name(RelationGetDescr(resultRelInfos[i].ri_RelationDesc),
+								   outdesc,
+								   gettext_noop("could not convert row type"));
+	}
+}
+
+/*
+ * Initialize the child-to-root tuple conversion map array required for
+ * capturing transition tuples.
+ *
+ * The map array can be indexed either by subplan index or by leaf-partition
+ * index.  For transition tables, we need a subplan-indexed access to the map,
+ * and where tuple-routing is present, we also require a leaf-indexed access.
+ */
+static void
+ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
+{
+	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+	/*
+	 * If partition tuple routing is set up, we will require partition-indexed
+	 * access. In that case, create the map array indexed by partition; we will
+	 * still be able to access the maps using a subplan index by converting the
+	 * subplan index to a partition index using 'subplan_partition_offsets'. If
+	 * tuple routing is not setup, it means we don't require partition-indexed
+	 * access. In that case, create just a subplan-indexed map.
+	 */
+	if (proute)
+	{
 		/*
-		 * Install the conversion map for the first plan for UPDATE and DELETE
-		 * operations.  It will be advanced each time we switch to the next
-		 * plan.  (INSERT operations set it every time, so we need not update
-		 * mtstate->mt_oc_transition_capture here.)
+		 * If a partition-indexed map array is to be created, the subplan map
+		 * array has to be NULL.  If the subplan map array is already created,
+		 * we won't be able to access the map using a partition index.
 		 */
-		if (mtstate->mt_transition_capture)
-			mtstate->mt_transition_capture->tcs_map =
-				mtstate->mt_transition_tupconv_maps[0];
+		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
+
+		ExecSetupChildParentMapForLeaf(proute);
+	}
+	else
+		ExecSetupChildParentMapForSubplan(mtstate);
+}
+
+/*
+ * For a given subplan index, get the tuple conversion map.
+ */
+static inline TupleConversionMap *
+tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
+{
+	/*
+	 * If a partition-index tuple conversion map array is allocated, we need to
+	 * first get the index into the partition array. Exactly *one* of the two
+	 * arrays is allocated. This is because if there is a partition array
+	 * required, we don't require subplan-indexed array since we can translate
+	 * subplan index into partition index. And, we create a subplan-indexed
+	 * array *only* if partition-indexed array is not required.
+	 */
+	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
+	{
+		int		leaf_index;
+		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+
+		/*
+		 * If subplan-indexed array is NULL, things should have been arranged
+		 * to convert the subplan index to partition index.
+		 */
+		Assert(proute && proute->subplan_partition_offsets != NULL);
+
+		leaf_index = proute->subplan_partition_offsets[whichplan];
+
+		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
+								 leaf_index);
+	}
+	else
+	{
+		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 	}
 }
 
@@ -1661,15 +1925,13 @@ ExecModifyTable(PlanState *pstate)
 				/* Prepare to convert transition tuples from this child. */
 				if (node->mt_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				if (node->mt_oc_transition_capture != NULL)
 				{
-					Assert(node->mt_transition_tupconv_maps != NULL);
 					node->mt_oc_transition_capture->tcs_map =
-						node->mt_transition_tupconv_maps[node->mt_whichplan];
+						tupconv_map_for_subplan(node, node->mt_whichplan);
 				}
 				continue;
 			}
@@ -1786,7 +2048,8 @@ ExecModifyTable(PlanState *pstate)
 				break;
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
-								  &node->mt_epqstate, estate, node->canSetTag);
+								  &node->mt_epqstate, estate,
+								  NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
@@ -1830,9 +2093,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ResultRelInfo *saved_resultRelInfo;
 	ResultRelInfo *resultRelInfo;
 	Plan	   *subplan;
+	int			firstVarno = 0;
+	Relation	firstResultRel = NULL;
 	ListCell   *l;
 	int			i;
 	Relation	rel;
+	bool		update_tuple_routing_needed = node->partColsUpdated;
 	PartitionTupleRouting *proute = NULL;
 	int			num_partitions = 0;
 
@@ -1907,6 +2173,16 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			resultRelInfo->ri_IndexRelationDescs == NULL)
 			ExecOpenIndices(resultRelInfo, mtstate->mt_onconflict != ONCONFLICT_NONE);
 
+		/*
+		 * If this is an UPDATE and a BEFORE UPDATE trigger is present, the
+		 * trigger itself might modify the partition-key values. So arrange for
+		 * tuple routing.
+		 */
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+			operation == CMD_UPDATE)
+			update_tuple_routing_needed = true;
+
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
 		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
@@ -1931,16 +2207,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 	estate->es_result_relation_info = saved_resultRelInfo;
 
-	/* Build state for INSERT tuple routing */
-	rel = mtstate->resultRelInfo->ri_RelationDesc;
-	if (operation == CMD_INSERT &&
-		rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	/* Get the target relation */
+	rel = (getTargetResultRelInfo(mtstate))->ri_RelationDesc;
+
+	/*
+	 * If it's not a partitioned table after all, UPDATE tuple routing should
+	 * not be attempted.
+	 */
+	if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		update_tuple_routing_needed = false;
+
+	/*
+	 * Build state for tuple routing if it's an INSERT or if it's an UPDATE of
+	 * partition key.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
+		(operation == CMD_INSERT || update_tuple_routing_needed))
 	{
 		proute = mtstate->mt_partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(mtstate,
 										   rel, node->nominalRelation,
 										   estate);
 		num_partitions = proute->num_partitions;
+
+		/*
+		 * Below are required as reference objects for mapping partition
+		 * attno's in expressions such as WithCheckOptions and RETURNING.
+		 */
+		firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
+		firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	}
 
 	/*
@@ -1951,6 +2246,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		ExecSetupTransitionCaptureState(mtstate, estate);
 
 	/*
+	 * Construct mapping from each of the per-subplan partition attnos to the
+	 * root attno.  This is required when during update row movement the tuple
+	 * descriptor of a source partition does not match the root partitioned
+	 * table descriptor.  In such a case we need to convert tuples to the root
+	 * tuple descriptor, because the search for destination partition starts
+	 * from the root.  Skip this setup if it's not a partition key update.
+	 */
+	if (update_tuple_routing_needed)
+		ExecSetupChildParentMapForSubplan(mtstate);
+
+	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
 	resultRelInfo = mtstate->resultRelInfo;
@@ -1980,26 +2286,29 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * Build WITH CHECK OPTION constraints for each leaf partition rel. Note
 	 * that we didn't build the withCheckOptionList for each partition within
 	 * the planner, but simple translation of the varattnos for each partition
-	 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-	 * cases are handled above.
+	 * will suffice.  This only occurs for the INSERT case or for UPDATE row
+	 * movement. DELETEs and local UPDATEs are handled above.
 	 */
 	if (node->withCheckOptionLists != NIL && num_partitions > 0)
 	{
-		List	   *wcoList;
-		PlanState  *plan;
+		List	   *first_wcoList;
 
 		/*
 		 * In case of INSERT on partitioned tables, there is only one plan.
 		 * Likewise, there is only one WITH CHECK OPTIONS list, not one per
-		 * partition.  We make a copy of the WCO qual for each partition; note
-		 * that, if there are SubPlans in there, they all end up attached to
-		 * the one parent Plan node.
+		 * partition. Whereas for UPDATE, there are as many WCOs as there are
+		 * plans. So in either case, use the WCO expression of the first
+		 * resultRelInfo as a reference to calculate attno's for the WCO
+		 * expression of each of the partitions. We make a copy of the WCO
+		 * qual for each partition. Note that, if there are SubPlans in there,
+		 * they all end up attached to the one parent Plan node.
 		 */
-		Assert(operation == CMD_INSERT &&
-			   list_length(node->withCheckOptionLists) == 1 &&
-			   mtstate->mt_nplans == 1);
-		wcoList = linitial(node->withCheckOptionLists);
-		plan = mtstate->mt_plans[0];
+		Assert(update_tuple_routing_needed ||
+			   (operation == CMD_INSERT &&
+				list_length(node->withCheckOptionLists) == 1 &&
+				mtstate->mt_nplans == 1));
+
+		first_wcoList = linitial(node->withCheckOptionLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
@@ -2008,17 +2317,26 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			ListCell   *ll;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have WithCheckOptions
+			 * initialized.
+			 */
+			if (resultRelInfo->ri_WithCheckOptions)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			mapped_wcoList = map_partition_varattnos(wcoList,
-													 node->nominalRelation,
-													 partrel, rel, NULL);
+			mapped_wcoList = map_partition_varattnos(first_wcoList,
+													firstVarno,
+													partrel, firstResultRel,
+													NULL);
 			foreach(ll, mapped_wcoList)
 			{
 				WithCheckOption *wco = castNode(WithCheckOption, lfirst(ll));
 				ExprState  *wcoExpr = ExecInitQual(castNode(List, wco->qual),
-												   plan);
+												   &mtstate->ps);
 
 				wcoExprs = lappend(wcoExprs, wcoExpr);
 			}
@@ -2035,7 +2353,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	{
 		TupleTableSlot *slot;
 		ExprContext *econtext;
-		List	   *returningList;
+		List	   *firstReturningList;
 
 		/*
 		 * Initialize result tuple slot and assign its rowtype using the first
@@ -2071,22 +2389,35 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		 * Build a projection for each leaf partition rel.  Note that we
 		 * didn't build the returningList for each partition within the
 		 * planner, but simple translation of the varattnos for each partition
-		 * will suffice.  This only occurs for the INSERT case; UPDATE/DELETE
-		 * are handled above.
+		 * will suffice.  This only occurs for the INSERT case or for UPDATE
+		 * row movement. DELETEs and local UPDATEs are handled above.
 		 */
-		returningList = linitial(node->returningLists);
+		firstReturningList = linitial(node->returningLists);
 		for (i = 0; i < num_partitions; i++)
 		{
 			Relation	partrel;
 			List	   *rlist;
 
 			resultRelInfo = proute->partitions[i];
+
+			/*
+			 * If we are referring to a resultRelInfo from one of the update
+			 * result rels, that result rel would already have a returningList
+			 * built.
+			 */
+			if (resultRelInfo->ri_projectReturning)
+				continue;
+
 			partrel = resultRelInfo->ri_RelationDesc;
 
-			/* varno = node->nominalRelation */
-			rlist = map_partition_varattnos(returningList,
-											node->nominalRelation,
-											partrel, rel, NULL);
+			/*
+			 * Use the returning expression of the first resultRelInfo as a
+			 * reference to calculate attno's for the returning expression of
+			 * each of the partitions.
+			 */
+			rlist = map_partition_varattnos(firstReturningList,
+											firstVarno,
+											partrel, firstResultRel, NULL);
 			resultRelInfo->ri_projectReturning =
 				ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
 										resultRelInfo->ri_RelationDesc->rd_att);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ddbbc79..747e545 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -204,6 +204,7 @@ _copyModifyTable(const ModifyTable *from)
 	COPY_SCALAR_FIELD(canSetTag);
 	COPY_SCALAR_FIELD(nominalRelation);
 	COPY_NODE_FIELD(partitioned_rels);
+	COPY_SCALAR_FIELD(partColsUpdated);
 	COPY_NODE_FIELD(resultRelations);
 	COPY_SCALAR_FIELD(resultRelIndex);
 	COPY_SCALAR_FIELD(rootResultRelIndex);
@@ -2263,6 +2264,7 @@ _copyPartitionedChildRelInfo(const PartitionedChildRelInfo *from)
 
 	COPY_SCALAR_FIELD(parent_relid);
 	COPY_NODE_FIELD(child_rels);
+	COPY_SCALAR_FIELD(part_cols_updated);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 30ccc9c..99b554a 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -908,6 +908,7 @@ _equalPartitionedChildRelInfo(const PartitionedChildRelInfo *a, const Partitione
 {
 	COMPARE_SCALAR_FIELD(parent_relid);
 	COMPARE_NODE_FIELD(child_rels);
+	COMPARE_SCALAR_FIELD(part_cols_updated);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5e72df1..b35bce3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,7 @@ _outModifyTable(StringInfo str, const ModifyTable *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partColsUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_INT_FIELD(resultRelIndex);
 	WRITE_INT_FIELD(rootResultRelIndex);
@@ -2105,6 +2106,7 @@ _outModifyTablePath(StringInfo str, const ModifyTablePath *node)
 	WRITE_BOOL_FIELD(canSetTag);
 	WRITE_UINT_FIELD(nominalRelation);
 	WRITE_NODE_FIELD(partitioned_rels);
+	WRITE_BOOL_FIELD(partColsUpdated);
 	WRITE_NODE_FIELD(resultRelations);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_NODE_FIELD(subroots);
@@ -2527,6 +2529,7 @@ _outPartitionedChildRelInfo(StringInfo str, const PartitionedChildRelInfo *node)
 
 	WRITE_UINT_FIELD(parent_relid);
 	WRITE_NODE_FIELD(child_rels);
+	WRITE_BOOL_FIELD(part_cols_updated);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 9925866..22d8b9d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1568,6 +1568,7 @@ _readModifyTable(void)
 	READ_BOOL_FIELD(canSetTag);
 	READ_UINT_FIELD(nominalRelation);
 	READ_NODE_FIELD(partitioned_rels);
+	READ_BOOL_FIELD(partColsUpdated);
 	READ_NODE_FIELD(resultRelations);
 	READ_INT_FIELD(resultRelIndex);
 	READ_INT_FIELD(rootResultRelIndex);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c5304b7..fd1a583 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,7 +1364,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 			case RTE_RELATION:
 				if (rte->relkind == RELKIND_PARTITIONED_TABLE)
 					partitioned_rels =
-						get_partitioned_child_rels(root, rel->relid);
+						get_partitioned_child_rels(root, rel->relid, NULL);
 				break;
 			case RTE_SUBQUERY:
 				build_partitioned_rels = true;
@@ -1403,7 +1403,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			List	   *cprels;
 
-			cprels = get_partitioned_child_rels(root, childrel->relid);
+			cprels = get_partitioned_child_rels(root, childrel->relid, NULL);
 			partitioned_rels = list_concat(partitioned_rels,
 										   list_copy(cprels));
 		}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283..86e7e74 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -279,6 +279,7 @@ static ProjectSet *make_project_set(List *tlist, Plan *subplan);
 static ModifyTable *make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partColsUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
@@ -2373,6 +2374,7 @@ create_modifytable_plan(PlannerInfo *root, ModifyTablePath *best_path)
 							best_path->canSetTag,
 							best_path->nominalRelation,
 							best_path->partitioned_rels,
+							best_path->partColsUpdated,
 							best_path->resultRelations,
 							subplans,
 							best_path->withCheckOptionLists,
@@ -6442,6 +6444,7 @@ static ModifyTable *
 make_modifytable(PlannerInfo *root,
 				 CmdType operation, bool canSetTag,
 				 Index nominalRelation, List *partitioned_rels,
+				 bool partColsUpdated,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam)
@@ -6468,6 +6471,7 @@ make_modifytable(PlannerInfo *root,
 	node->canSetTag = canSetTag;
 	node->nominalRelation = nominalRelation;
 	node->partitioned_rels = partitioned_rels;
+	node->partColsUpdated = partColsUpdated;
 	node->resultRelations = resultRelations;
 	node->resultRelIndex = -1;	/* will be set correctly in setrefs.c */
 	node->rootResultRelIndex = -1;	/* will be set correctly in setrefs.c */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b52dad..5387043 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1101,6 +1101,7 @@ inheritance_planner(PlannerInfo *root)
 	Query	   *parent_parse;
 	Bitmapset  *parent_relids = bms_make_singleton(top_parentRTindex);
 	PlannerInfo **parent_roots = NULL;
+	bool		partColsUpdated = false;
 
 	Assert(parse->commandType != CMD_INSERT);
 
@@ -1172,7 +1173,8 @@ inheritance_planner(PlannerInfo *root)
 	if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		nominalRelation = top_parentRTindex;
-		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex);
+		partitioned_rels = get_partitioned_child_rels(root, top_parentRTindex,
+													  &partColsUpdated);
 		/* The root partitioned table is included as a child rel */
 		Assert(list_length(partitioned_rels) >= 1);
 	}
@@ -1512,6 +1514,7 @@ inheritance_planner(PlannerInfo *root)
 									 parse->canSetTag,
 									 nominalRelation,
 									 partitioned_rels,
+									 partColsUpdated,
 									 resultRelations,
 									 subpaths,
 									 subroots,
@@ -2123,6 +2126,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 										parse->canSetTag,
 										parse->resultRelation,
 										NIL,
+										false,
 										list_make1_int(parse->resultRelation),
 										list_make1(path),
 										list_make1(root),
@@ -6155,17 +6159,24 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 /*
  * get_partitioned_child_rels
  *		Returns a list of the RT indexes of the partitioned child relations
- *		with rti as the root parent RT index.
+ *		with rti as the root parent RT index. Also sets
+ *		*part_cols_updated to true if any of the root rte's updated
+ *		columns is used in the partition key either of the relation whose RTI
+ *		is specified or of any child relation.
  *
  * Note: This function might get called even for range table entries that
  * are not partitioned tables; in such a case, it will simply return NIL.
  */
 List *
-get_partitioned_child_rels(PlannerInfo *root, Index rti)
+get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *part_cols_updated)
 {
 	List	   *result = NIL;
 	ListCell   *l;
 
+	if (part_cols_updated)
+		*part_cols_updated = false;
+
 	foreach(l, root->pcinfo_list)
 	{
 		PartitionedChildRelInfo *pc = lfirst_node(PartitionedChildRelInfo, l);
@@ -6173,6 +6184,8 @@ get_partitioned_child_rels(PlannerInfo *root, Index rti)
 		if (pc->parent_relid == rti)
 		{
 			result = pc->child_rels;
+			if (part_cols_updated)
+				*part_cols_updated = pc->part_cols_updated;
 			break;
 		}
 	}
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 7ef391f..e6b1534 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -105,7 +105,8 @@ static void expand_partitioned_rtentry(PlannerInfo *root,
 						   RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels);
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *part_cols_updated);
 static void expand_single_inheritance_child(PlannerInfo *root,
 								RangeTblEntry *parentrte,
 								Index parentRTindex, Relation parentrel,
@@ -1461,16 +1462,19 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	if (RelationGetPartitionDesc(oldrelation) != NULL)
 	{
 		List	   *partitioned_child_rels = NIL;
+		bool		part_cols_updated = false;
 
 		Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
 
 		/*
 		 * If this table has partitions, recursively expand them in the order
-		 * in which they appear in the PartitionDesc.
+		 * in which they appear in the PartitionDesc.  While at it, also
+		 * extract the partition key columns of all the partitioned tables.
 		 */
 		expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
 								   lockmode, &root->append_rel_list,
-								   &partitioned_child_rels);
+								   &partitioned_child_rels,
+								   &part_cols_updated);
 
 		/*
 		 * We keep a list of objects in root, each of which maps a root
@@ -1487,6 +1491,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 			pcinfo = makeNode(PartitionedChildRelInfo);
 			pcinfo->parent_relid = rti;
 			pcinfo->child_rels = partitioned_child_rels;
+			pcinfo->part_cols_updated = part_cols_updated;
 			root->pcinfo_list = lappend(root->pcinfo_list, pcinfo);
 		}
 	}
@@ -1563,7 +1568,8 @@ static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 						   Index parentRTindex, Relation parentrel,
 						   PlanRowMark *top_parentrc, LOCKMODE lockmode,
-						   List **appinfos, List **partitioned_child_rels)
+						   List **appinfos, List **partitioned_child_rels,
+						   bool *part_cols_updated)
 {
 	int			i;
 	RangeTblEntry *childrte;
@@ -1578,6 +1584,17 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 
 	Assert(parentrte->inh);
 
+	/*
+	 * Note down whether any partition key cols are being updated. Though it's
+	 * the root partitioned table's updatedCols we are interested in, we
+	 * instead use parentrte to get the updatedCols. This is convenient because
+	 * parentrte already has the root partrel's updatedCols translated to match
+	 * the attribute ordering of parentrel.
+	 */
+	if (!*part_cols_updated)
+		*part_cols_updated =
+			has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
+
 	/* First expand the partitioned table itself. */
 	expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel,
 									top_parentrc, parentrel,
@@ -1617,7 +1634,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
 		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 			expand_partitioned_rtentry(root, childrte, childRTindex,
 									   childrel, top_parentrc, lockmode,
-									   appinfos, partitioned_child_rels);
+									   appinfos, partitioned_child_rels,
+									   part_cols_updated);
 
 		/* Close child relation, but keep locks */
 		heap_close(childrel, NoLock);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index fa4b468..91295eb 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3274,6 +3274,8 @@ create_lockrows_path(PlannerInfo *root, RelOptInfo *rel,
  * 'partitioned_rels' is an integer list of RT indexes of non-leaf tables in
  *		the partition tree, if this is an UPDATE/DELETE to a partitioned table.
  *		Otherwise NIL.
+ * 'partColsUpdated' is true if any partitioning columns are being updated,
+ *		either from the target relation or a descendent partitioned table.
  * 'resultRelations' is an integer list of actual RT indexes of target rel(s)
  * 'subpaths' is a list of Path(s) producing source data (one per rel)
  * 'subroots' is a list of PlannerInfo structs (one per rel)
@@ -3287,6 +3289,7 @@ ModifyTablePath *
 create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
@@ -3354,6 +3357,7 @@ create_modifytable_path(PlannerInfo *root, RelOptInfo *rel,
 	pathnode->canSetTag = canSetTag;
 	pathnode->nominalRelation = nominalRelation;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
+	pathnode->partColsUpdated = partColsUpdated;
 	pathnode->resultRelations = resultRelations;
 	pathnode->subpaths = subpaths;
 	pathnode->subroots = subroots;
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index b5df357..5aede76 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -62,11 +62,24 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								for every leaf partition in the partition tree.
  * num_partitions				Number of leaf partitions in the partition tree
  *								(= 'partitions' array length)
- * partition_tupconv_maps		Array of TupleConversionMap objects with one
+ * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
  *								entry for every leaf partition (required to
  *								convert input tuple based on the root table's
  *								rowtype to a leaf partition's rowtype after
  *								tuple routing is done)
+ * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
+ *								entry for every leaf partition (required to
+ *								convert input tuple based on the leaf
+ *								partition's rowtype to the root table's rowtype
+ *								after tuple routing is done)
+ * child_parent_map_not_required  Array of bool. True value means that a map is
+ *								determined to be not required for the given
+ *								partition. False means either we haven't yet
+ *								checked if a map is required, or it was
+ *								determined to be required.
+ * subplan_partition_offsets	int array ordered by UPDATE subplans. Each
+ *								element of this array has the index into the
+ *								corresponding partition in 'partitions' array.
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -79,10 +92,25 @@ typedef struct PartitionTupleRouting
 	int			num_dispatch;
 	ResultRelInfo **partitions;
 	int			num_partitions;
-	TupleConversionMap **partition_tupconv_maps;
+	TupleConversionMap **parent_child_tupconv_maps;
+	TupleConversionMap **child_parent_tupconv_maps;
+	bool	   *child_parent_map_not_required;
+	int		   *subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
+/*
+ * TupConvMapForLeaf -- For a given leaf partitions index, get the tuple
+ * conversion map.
+ *
+ * If it is already determined that the map is not required, return NULL;
+ * else create one if not already created.
+ */
+#define TupConvMapForLeaf(proute, rootRelInfo, leaf_index)					\
+	((proute)->child_parent_map_not_required[(leaf_index)] ?				\
+	NULL : CreateTupConvMapForLeaf((proute), (rootRelInfo), (leaf_index)))
+
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel, Index resultRTindex,
 							   EState *estate);
@@ -90,6 +118,13 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
+extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
+extern TupleConversionMap *CreateTupConvMapForLeaf(PartitionTupleRouting *proute,
+						ResultRelInfo *rootRelInfo, int leaf_index);
+extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
+						  HeapTuple tuple,
+						  TupleTableSlot *new_slot,
+						  TupleTableSlot **p_my_slot);
 extern void ExecCleanupTupleRouting(PartitionTupleRouting *proute);
 
 #endif							/* EXECPARTITION_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4bb5cb1..defd5cd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -991,8 +991,8 @@ typedef struct ModifyTableState
 	/* controls transition table population for specified operation */
 	struct TransitionCaptureState *mt_oc_transition_capture;
 	/* controls transition table population for INSERT...ON CONFLICT UPDATE */
-	TupleConversionMap **mt_transition_tupconv_maps;
-	/* Per plan/partition tuple conversion */
+	TupleConversionMap **mt_per_subplan_tupconv_maps;
+	/* Per plan map for tuple conversion from child to root */
 } ModifyTableState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5..baf3c07 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -219,6 +219,7 @@ typedef struct ModifyTable
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partColsUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	int			resultRelIndex; /* index of first resultRel in plan's list */
 	int			rootResultRelIndex; /* index of the partitioned table root */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 71689b8..6bf68f3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1674,6 +1674,7 @@ typedef struct ModifyTablePath
 	Index		nominalRelation;	/* Parent RT index for use of EXPLAIN */
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
+	bool		partColsUpdated;	/* some part key in hierarchy updated */
 	List	   *resultRelations;	/* integer list of RT indexes */
 	List	   *subpaths;		/* Path(s) producing source data */
 	List	   *subroots;		/* per-target-table PlannerInfos */
@@ -2124,6 +2125,8 @@ typedef struct PartitionedChildRelInfo
 
 	Index		parent_relid;
 	List	   *child_rels;
+	bool		part_cols_updated;	/* is the partition key of any of
+									 * the partitioned tables updated? */
 } PartitionedChildRelInfo;
 
 /*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 725694f..ef7173f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -242,6 +242,7 @@ extern ModifyTablePath *create_modifytable_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						CmdType operation, bool canSetTag,
 						Index nominalRelation, List *partitioned_rels,
+						bool partColsUpdated,
 						List *resultRelations, List *subpaths,
 						List *subroots,
 						List *withCheckOptionLists, List *returningLists,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 997b91f..29173d3 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,7 +57,8 @@ extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 extern bool plan_cluster_use_sort(Oid tableOid, Oid indexOid);
 
-extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti);
+extern List *get_partitioned_child_rels(PlannerInfo *root, Index rti,
+						   bool *part_cols_updated);
 extern List *get_partitioned_child_rels_for_join(PlannerInfo *root,
 									Relids join_relids);
 
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index b69ceaa..95aa0e8 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -198,36 +198,479 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 
 DROP TABLE update_test;
 DROP TABLE upsert_test;
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+---------------------------
+-- UPDATE with row movement
+---------------------------
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+CREATE TABLE range_parted (
 	a text,
-	b int
-) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
-create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-ERROR:  new row for relation "part_a_1_a_10" violates partition constraint
-DETAIL:  Failing row contains (b, 1).
-update range_parted set b = b - 1 where b = 10;
-ERROR:  new row for relation "part_b_10_b_20" violates partition constraint
-DETAIL:  Failing row contains (b, 9).
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
+) PARTITION BY RANGE (a, b);
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+(6 rows)
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+             QUERY PLAN              
+-------------------------------------
+ Update on range_parted
+   Update on part_a_1_a_10
+   Update on part_a_10_a_20
+   Update on part_b_1_b_10
+   Update on part_c_1_100
+   Update on part_d_1_15
+   Update on part_d_15_20
+   Update on part_b_20_b_30
+   ->  Seq Scan on part_a_1_a_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_a_10_a_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_1_b_10
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_c_1_100
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_1_15
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_d_15_20
+         Filter: (c > '97'::numeric)
+   ->  Seq Scan on part_b_20_b_30
+         Filter: (c > '97'::numeric)
+(22 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+ERROR:  new row for relation "part_c_100_200" violates partition constraint
+DETAIL:  Failing row contains (105, 85, null, b, 15).
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+ERROR:  new row for relation "part_c_1_100" violates partition constraint
+DETAIL:  Failing row contains (null, 1, 96, 12, a).
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+  c  | b  | a 
+-----+----+---
+ 116 | 12 | b
+ 117 | 13 | b
+ 125 | 15 | b
+ 125 | 17 | b
+(4 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d | e 
+----------------+---+----+-----+---+---
+ part_a_10_a_20 | a | 10 | 200 | 1 | 1
+ part_a_1_a_10  | a |  1 |   1 | 1 | 1
+ part_d_1_15    | b | 12 | 116 | 1 | 1
+ part_d_1_15    | b | 13 | 117 | 2 | 2
+ part_d_1_15    | b | 15 | 125 | 6 | 6
+ part_d_1_15    | b | 17 | 125 | 9 | 9
+(6 rows)
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+ERROR:  new row for relation "part_d_1_15" violates partition constraint
+DETAIL:  Failing row contains (2, 117, 2, b, 7).
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+ a | ?column? 
+---+----------
+ a |      204
+ b |      124
+ b |      134
+ b |      136
+(4 rows)
+
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_a_1_a_10 | a |  4 | 200 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+(6 rows)
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
+-- ok
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (a, 4, 120, 1, 1).
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+ERROR:  new row violates check option for view "upview"
+DETAIL:  Failing row contains (b, 15, 120, 1, 1).
+-- ok, row movement , check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+:show_data;
+   partname    | a | b  |  c  | d | e 
+---------------+---+----+-----+---+---
+ part_a_1_a_10 | a |  1 |   1 | 1 | 1
+ part_b_1_b_10 | b |  7 | 117 | 2 | 2
+ part_b_1_b_10 | b |  9 | 125 | 6 | 6
+ part_d_1_15   | b | 11 | 125 | 9 | 9
+ part_d_1_15   | b | 12 | 116 | 1 | 1
+ part_d_1_15   | b | 15 | 199 | 1 | 1
+(6 rows)
+
+-- cleanup
+DROP VIEW upview;
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+ range_parted  | a | b  | c  | d  | e 
+---------------+---+----+----+----+---
+ (b,15,95,16,) | b | 15 | 95 | 16 | 
+ (b,17,95,19,) | b | 17 | 95 | 19 | 
+(2 rows)
+
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  95 | 16 | 
+ part_c_1_100   | b | 17 |  95 | 19 | 
+(6 rows)
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+                 (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+    return null;
+  end;
+$$;
+CREATE TRIGGER trans_updatetrig
+  AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,110,1,), (b,13,98,2,), (b,15,106,16,), (b,17,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  98 |  2 | 
+ part_d_15_20   | b | 15 | 106 | 16 | 
+ part_d_15_20   | b | 17 | 106 | 19 | 
+ part_d_1_15    | b | 12 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+  AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+  AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,12,96,1,), (b,13,97,2,), (b,15,105,16,), (b,17,105,19,), new table = (b,12,146,1,), (b,13,147,2,), (b,15,155,16,), (b,17,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 15 | 155 | 16 | 
+ part_d_15_20   | b | 17 | 155 | 19 | 
+ part_d_1_15    | b | 12 | 146 |  1 | 
+ part_d_1_15    | b | 13 | 147 |  2 | 
+(6 rows)
+
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, usually the
+-- conversion is not needed for inserted rows, because the original tuple is
+-- already compatible with the desired transition tuple format. But conversion
+-- happens when there is a BR trigger because the trigger can change the
+-- inserted row. So we require to install BR triggers on those child partitions
+-- where the rows are moved as part of update-row-movement.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+   NEW.b = NEW.b + 1;
+   return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,110,1,), (b,15,98,2,), (b,17,106,16,), (b,19,106,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 15 |  98 |  2 | 
+ part_d_15_20   | b | 17 | 106 | 16 | 
+ part_d_15_20   | b | 19 | 106 | 19 | 
+ part_d_1_15    | b | 15 | 110 |  1 | 
+(6 rows)
+
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+NOTICE:  trigger = trans_updatetrig, old table = (b,13,96,1,), (b,14,97,2,), (b,16,105,16,), (b,18,105,19,), new table = (b,15,146,1,), (b,16,147,2,), (b,17,155,16,), (b,19,155,19,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_d_15_20   | b | 17 | 155 | 16 | 
+ part_d_15_20   | b | 19 | 155 | 19 | 
+ part_d_1_15    | b | 15 | 146 |  1 | 
+ part_d_1_15    | b | 16 | 147 |  2 | 
+(6 rows)
+
+-- Case where per-partition tuple conversion map array is allocated, but the
+-- map is not required for the particular tuple that is routed, thanks to
+-- matching table attributes of the partition and the target table.
+:init_range_parted;
+UPDATE range_parted set b = 15 WHERE b = 1;
+NOTICE:  trigger = trans_updatetrig, old table = (a,1,1,1,), new table = (a,15,1,1,)
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_10_a_20 | a | 15 |   1 |  1 | 
+ part_c_1_100   | b | 13 |  96 |  1 | 
+ part_c_1_100   | b | 14 |  97 |  2 | 
+ part_d_15_20   | b | 16 | 105 | 16 | 
+ part_d_15_20   | b | 18 | 105 | 19 | 
+(6 rows)
+
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+-- RLS policies with update-row-movement
+-----------------------------------------
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+   NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_subplan" for table "range_parted"
 -- ok
-update range_parted set b = b + 1 where b = 10;
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+-- RLS policy expression contains whole row.
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+ERROR:  new row violates row-level security policy "policy_range_parted_wholerow" for table "range_parted"
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+-- statement triggers with update row movement
+---------------------------------------------------
+:init_range_parted;
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+  AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+  AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+  AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+  AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+  AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+  AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+  AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+  AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+  AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+  AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+  AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+  AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+NOTICE:  trigger = parent_update_trig fired on table range_parted during UPDATE
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 150 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_c_1_100   | b | 15 |  55 | 16 | 
+ part_c_1_100   | b | 17 |  55 | 19 | 
+(6 rows)
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
-                                  Table "public.part_def"
- Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description 
---------+---------+-----------+----------+---------+----------+--------------+-------------
- a      | text    |           |          |         | extended |              | 
- b      | integer |           |          |         | plain    |              | 
+                                       Table "public.part_def"
+ Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 
+--------+-------------------+-----------+----------+---------+----------+--------------+-------------
+ a      | text              |           |          |         | extended |              | 
+ b      | bigint            |           |          |         | plain    |              | 
+ c      | numeric           |           |          |         | main     |              | 
+ d      | integer           |           |          |         | plain    |              | 
+ e      | character varying |           |          |         | extended |              | 
 Partition of: range_parted DEFAULT
-Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'a'::text) AND (b >= 10) AND (b < 20)) OR ((a = 'b'::text) AND (b >= 1) AND (b < 10)) OR ((a = 'b'::text) AND (b >= 10) AND (b < 20)))))
+Partition constraint: (NOT ((a IS NOT NULL) AND (b IS NOT NULL) AND (((a = 'a'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'a'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '1'::bigint) AND (b < '10'::bigint)) OR ((a = 'b'::text) AND (b >= '10'::bigint) AND (b < '20'::bigint)) OR ((a = 'b'::text) AND (b >= '20'::bigint) AND (b < '30'::bigint)))))
 
 insert into range_parted values ('c', 9);
 -- ok
@@ -235,21 +678,192 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 ERROR:  new row for relation "part_def" violates partition constraint
-DETAIL:  Failing row contains (a, 9).
-create table list_parted (
+DETAIL:  Failing row contains (a, 9, null, null, null).
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+ERROR:  new row for relation "part_a_10_a_20" violates partition constraint
+DETAIL:  Failing row contains (ad, 10, 200, 1, null).
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+ partname | a  | b  |  c  | d  | e 
+----------+----+----+-----+----+---
+ part_def | ad |  1 |   1 |  1 | 
+ part_def | ad | 10 | 200 |  1 | 
+ part_def | bd | 12 |  96 |  1 | 
+ part_def | bd | 13 |  97 |  2 | 
+ part_def | bd | 15 | 105 | 16 | 
+ part_def | bd | 17 | 105 | 19 | 
+ part_def | d  |  9 |     |    | 
+(7 rows)
+
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+    partname    | a | b  |  c  | d  | e 
+----------------+---+----+-----+----+---
+ part_a_10_a_20 | a | 10 | 200 |  1 | 
+ part_a_1_a_10  | a |  1 |   1 |  1 | 
+ part_c_1_100   | b | 12 |  96 |  1 | 
+ part_c_1_100   | b | 13 |  97 |  2 | 
+ part_d_15_20   | b | 15 | 105 | 16 | 
+ part_d_15_20   | b | 17 | 105 | 19 | 
+ part_def       | d |  9 |     |    | 
+(7 rows)
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+CREATE TABLE list_parted (
 	a text,
 	b int
-) partition by list (a);
-create table list_part1  partition of list_parted for values in ('a', 'b');
-create table list_default partition of list_parted default;
-insert into list_part1 values ('a', 1);
-insert into list_default values ('d', 10);
+) PARTITION BY list (a);
+CREATE TABLE list_part1  PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
 -- fail
-update list_default set a = 'a' where a = 'd';
+UPDATE list_default set a = 'a' WHERE a = 'd';
 ERROR:  new row for relation "list_default" violates partition constraint
 DETAIL:  Failing row contains (a, 10).
 -- ok
-update list_default set a = 'x' where a = 'd';
+UPDATE list_default set a = 'x' WHERE a = 'd';
+DROP TABLE list_parted;
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+ERROR:  new row for relation "sub_part2" violates partition constraint
+DETAIL:  Failing row contains (2, 10, 2).
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+  tableoid  | a | b | c  
+------------+---+---+----
+ list_part1 | 2 | 5 | 50
+(1 row)
+
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+(1 row)
+
+-----------
+-- Tests for BR UPDATE triggers changing the partition key.
+-----------
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 60
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b  = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+   return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+   for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+NOTICE:  Trigger: Got OLD row (2,70,1), but returning NULL
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part2  | 1 |  2 | 10
+ sub_part2  | 1 |  2 | 70
+(4 rows)
+
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+ sub_part1  | 1 |  1 | 70
+ sub_part2  | 1 |  2 | 10
+(4 rows)
+
+DROP FUNCTION func_parted_mod_b();
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+  tableoid  | a | b  | c  
+------------+---+----+----
+ list_part1 | 2 |  1 | 70
+ list_part1 | 2 |  2 | 10
+ list_part1 | 2 | 52 | 50
+ list_part1 | 3 |  6 | 60
+(4 rows)
+
+DROP TABLE non_parted;
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
 create or replace function dummy_hashint4(a int4, seed int8) returns int8 as
@@ -271,14 +885,11 @@ insert into hpart4 values (3, 4);
 update hpart1 set a = 3, b=4 where a = 1;
 ERROR:  new row for relation "hpart1" violates partition constraint
 DETAIL:  Failing row contains (3, 4).
+-- ok, row movement
 update hash_parted set b = b - 1 where b = 1;
-ERROR:  new row for relation "hpart1" violates partition constraint
-DETAIL:  Failing row contains (1, 0).
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 -- cleanup
-drop table range_parted;
-drop table list_parted;
 drop table hash_parted;
 drop operator class custom_opclass using hash;
 drop function dummy_hashint4(a int4, seed int8);
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index 0c70d64..7f49656 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -107,25 +107,338 @@ INSERT INTO upsert_test VALUES (1, 'Bat') ON CONFLICT(a)
 DROP TABLE update_test;
 DROP TABLE upsert_test;
 
--- update to a partition should check partition bound constraint for the new tuple
-create table range_parted (
+
+---------------------------
+-- UPDATE with row movement
+---------------------------
+
+-- When a partitioned table receives an UPDATE to the partitioned key and the
+-- new values no longer meet the partition's bound, the row must be moved to
+-- the correct partition for the new partition key (if one exists). We must
+-- also ensure that updatable views on partitioned tables properly enforce any
+-- WITH CHECK OPTION that is defined. The situation with triggers in this case
+-- also requires thorough testing as partition key updates causing row
+-- movement convert UPDATEs into DELETE+INSERT.
+
+CREATE TABLE range_parted (
 	a text,
-	b int
-) partition by range (a, b);
-create table part_a_1_a_10 partition of range_parted for values from ('a', 1) to ('a', 10);
-create table part_a_10_a_20 partition of range_parted for values from ('a', 10) to ('a', 20);
-create table part_b_1_b_10 partition of range_parted for values from ('b', 1) to ('b', 10);
-create table part_b_10_b_20 partition of range_parted for values from ('b', 10) to ('b', 20);
-insert into part_a_1_a_10 values ('a', 1);
-insert into part_b_10_b_20 values ('b', 10);
+	b bigint,
+	c numeric,
+	d int,
+	e varchar
+) PARTITION BY RANGE (a, b);
 
--- fail
-update part_a_1_a_10 set a = 'b' where a = 'a';
-update range_parted set b = b - 1 where b = 10;
+-- Create partitions intentionally in descending bound order, so as to test
+-- that update-row-movement works with the leaf partitions not in bound order.
+CREATE TABLE part_b_20_b_30 (e varchar, c numeric, a text, b bigint, d int);
+ALTER TABLE range_parted ATTACH PARTITION part_b_20_b_30 FOR VALUES FROM ('b', 20) TO ('b', 30);
+CREATE TABLE part_b_10_b_20 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY RANGE (c);
+CREATE TABLE part_b_1_b_10 PARTITION OF range_parted FOR VALUES FROM ('b', 1) TO ('b', 10);
+ALTER TABLE range_parted ATTACH PARTITION part_b_10_b_20 FOR VALUES FROM ('b', 10) TO ('b', 20);
+CREATE TABLE part_a_10_a_20 PARTITION OF range_parted FOR VALUES FROM ('a', 10) TO ('a', 20);
+CREATE TABLE part_a_1_a_10 PARTITION OF range_parted FOR VALUES FROM ('a', 1) TO ('a', 10);
+
+-- Check that partition-key UPDATE works sanely on a partitioned table that
+-- does not have any child partitions.
+UPDATE part_b_10_b_20 set b = b - 6;
+
+-- Create some more partitions following the above pattern of descending bound
+-- order, but let's make the situation a bit more complex by having the
+-- attribute numbers of the columns vary from their parent partition.
+CREATE TABLE part_c_100_200 (e varchar, c numeric, a text, b bigint, d int) PARTITION BY range (abs(d));
+ALTER TABLE part_c_100_200 DROP COLUMN e, DROP COLUMN c, DROP COLUMN a;
+ALTER TABLE part_c_100_200 ADD COLUMN c numeric, ADD COLUMN e varchar, ADD COLUMN a text;
+ALTER TABLE part_c_100_200 DROP COLUMN b;
+ALTER TABLE part_c_100_200 ADD COLUMN b bigint;
+CREATE TABLE part_d_1_15 PARTITION OF part_c_100_200 FOR VALUES FROM (1) TO (15);
+CREATE TABLE part_d_15_20 PARTITION OF part_c_100_200 FOR VALUES FROM (15) TO (20);
+
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_100_200 FOR VALUES FROM (100) TO (200);
+
+CREATE TABLE part_c_1_100 (e varchar, d int, c numeric, b bigint, a text);
+ALTER TABLE part_b_10_b_20 ATTACH PARTITION part_c_1_100 FOR VALUES FROM (1) TO (100);
+
+\set init_range_parted 'truncate range_parted; insert into range_parted VALUES (''a'', 1, 1, 1), (''a'', 10, 200, 1), (''b'', 12, 96, 1), (''b'', 13, 97, 2), (''b'', 15, 105, 16), (''b'', 17, 105, 19)'
+\set show_data 'select tableoid::regclass::text COLLATE "C" partname, * from range_parted ORDER BY 1, 2, 3, 4, 5, 6'
+:init_range_parted;
+:show_data;
+
+-- The order of subplans should be in bound order
+EXPLAIN (costs off) UPDATE range_parted set c = c - 50 WHERE c > 97;
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_c_100_200 set c = c - 20, d = c WHERE c = 105;
+-- fail, no partition key update, so no attempt to move tuple,
+-- but "a = 'a'" violates partition constraint enforced by root partition)
+UPDATE part_b_10_b_20 set a = 'a';
+-- ok, partition key update, no constraint violation
+UPDATE range_parted set d = d - 10 WHERE d > 10;
+-- ok, no partition key update, no constraint violation
+UPDATE range_parted set e = d;
+-- No row found
+UPDATE part_c_1_100 set c = c + 20 WHERE c = 98;
+-- ok, row movement
+UPDATE part_b_10_b_20 set c = c + 20 returning c, b, a;
+:show_data;
+
+-- fail, row movement happens only within the partition subtree.
+UPDATE part_b_10_b_20 set b = b - 6 WHERE c > 116 returning *;
+-- ok, row movement, with subset of rows moved into different partition.
+UPDATE range_parted set b = b - 6 WHERE c > 116 returning a, b + c;
+
+:show_data;
+
+-- Common table needed for multiple test scenarios.
+CREATE TABLE mintab(c1 int);
+INSERT into mintab VALUES (120);
+
+-- update partition key using updatable view.
+CREATE VIEW upview AS SELECT * FROM range_parted WHERE (select c > c1 FROM mintab) WITH CHECK OPTION;
+-- ok
+UPDATE upview set c = 199 WHERE b = 4;
+-- fail, check option violation
+UPDATE upview set c = 120 WHERE b = 4;
+-- fail, row movement with check option violation
+UPDATE upview set a = 'b', b = 15, c = 120 WHERE b = 4;
+-- ok, row movement , check option passes
+UPDATE upview set a = 'b', b = 15 WHERE b = 4;
+
+:show_data;
+
+-- cleanup
+DROP VIEW upview;
+
+-- RETURNING having whole-row vars.
+----------------------------------
+:init_range_parted;
+UPDATE range_parted set c = 95 WHERE a = 'b' and b > 10 and c > 100 returning (range_parted)  , *;
+:show_data;
+
+
+-- Transition tables with update row movement
+---------------------------------------------
+:init_range_parted;
+
+CREATE FUNCTION trans_updatetrigfunc() RETURNS trigger LANGUAGE plpgsql AS
+$$
+  begin
+    raise notice 'trigger = %, old table = %, new table = %',
+                 TG_NAME,
+                 (select string_agg(old_table::text, ', ' ORDER BY a) FROM old_table),
+                 (select string_agg(new_table::text, ', ' ORDER BY a) FROM new_table);
+    return null;
+  end;
+$$;
+
+CREATE TRIGGER trans_updatetrig
+  AFTER UPDATE ON range_parted REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end ) WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+
+-- Enabling OLD TABLE capture for both DELETE as well as UPDATE stmt triggers
+-- should not cause DELETEd rows to be captured twice. Similar thing for
+-- INSERT triggers and inserted rows.
+CREATE TRIGGER trans_deletetrig
+  AFTER DELETE ON range_parted REFERENCING OLD TABLE AS old_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+CREATE TRIGGER trans_inserttrig
+  AFTER INSERT ON range_parted REFERENCING NEW TABLE AS new_table
+  FOR EACH STATEMENT EXECUTE PROCEDURE trans_updatetrigfunc();
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+DROP TRIGGER trans_deletetrig ON range_parted;
+DROP TRIGGER trans_inserttrig ON range_parted;
+-- Don't drop trans_updatetrig yet. It is required below.
+
+-- Test with transition tuple conversion happening for rows moved into the
+-- new partition. This requires a trigger that references transition table
+-- (we already have trans_updatetrig). For inserted rows, usually the
+-- conversion is not needed for inserted rows, because the original tuple is
+-- already compatible with the desired transition tuple format. But conversion
+-- happens when there is a BR trigger because the trigger can change the
+-- inserted row. So we require to install BR triggers on those child partitions
+-- where the rows are moved as part of update-row-movement.
+CREATE FUNCTION func_parted_mod_b() RETURNS trigger AS $$
+BEGIN
+   NEW.b = NEW.b + 1;
+   return NEW;
+END $$ language plpgsql;
+CREATE TRIGGER trig_c1_100 BEFORE UPDATE OR INSERT ON part_c_1_100
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d1_15 BEFORE UPDATE OR INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+CREATE TRIGGER trig_d15_20 BEFORE UPDATE OR INSERT ON part_d_15_20
+   FOR EACH ROW EXECUTE PROCEDURE func_parted_mod_b();
+:init_range_parted;
+UPDATE range_parted set c = (case when c = 96 then 110 else c + 1 end) WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+:init_range_parted;
+UPDATE range_parted set c = c + 50 WHERE a = 'b' and b > 10 and c >= 96;
+:show_data;
+
+-- Case where per-partition tuple conversion map array is allocated, but the
+-- map is not required for the particular tuple that is routed, thanks to
+-- matching table attributes of the partition and the target table.
+:init_range_parted;
+UPDATE range_parted set b = 15 WHERE b = 1;
+:show_data;
+
+DROP TRIGGER trans_updatetrig ON range_parted;
+DROP TRIGGER trig_c1_100 ON part_c_1_100;
+DROP TRIGGER trig_d1_15 ON part_d_1_15;
+DROP TRIGGER trig_d15_20 ON part_d_15_20;
+DROP FUNCTION func_parted_mod_b();
+
+-- RLS policies with update-row-movement
+-----------------------------------------
+
+ALTER TABLE range_parted ENABLE ROW LEVEL SECURITY;
+CREATE USER regress_range_parted_user;
+GRANT ALL ON range_parted, mintab TO regress_range_parted_user;
+CREATE POLICY seeall ON range_parted AS PERMISSIVE FOR SELECT USING (true);
+CREATE POLICY policy_range_parted ON range_parted for UPDATE USING (true) WITH CHECK (c % 2 = 0);
+
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error while moving row from
+-- part_a_10_a_20 to part_d_1_15, because we are setting 'c' to an odd number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+
+RESET SESSION AUTHORIZATION;
+-- Create a trigger on part_d_1_15
+CREATE FUNCTION func_d_1_15() RETURNS trigger AS $$
+BEGIN
+   NEW.c = NEW.c + 1; -- Make even numbers odd, or vice versa
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_d_1_15 BEFORE INSERT ON part_d_1_15
+   FOR EACH ROW EXECUTE PROCEDURE func_d_1_15();
+
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+
+-- Here, RLS checks should succeed while moving row from part_a_10_a_20 to
+-- part_d_1_15. Even though the UPDATE is setting 'c' to an odd number, the
+-- trigger at the destination partition again makes it an even number.
+UPDATE range_parted set a = 'b', c = 151 WHERE a = 'a' and c = 200;
+
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- This should fail with RLS violation error. Even though the UPDATE is setting
+-- 'c' to an even number, the trigger at the destination partition again makes
+-- it an odd number.
+UPDATE range_parted set a = 'b', c = 150 WHERE a = 'a' and c = 200;
+
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP TRIGGER trig_d_1_15 ON part_d_1_15;
+DROP FUNCTION func_d_1_15();
+
+-- Policy expression contains SubPlan
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_subplan on range_parted AS RESTRICTIVE for UPDATE USING (true)
+    WITH CHECK ((SELECT range_parted.c <= c1 FROM mintab ));
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, mintab has row with c1 = 120
+UPDATE range_parted set a = 'b', c = 122 WHERE a = 'a' and c = 200;
 -- ok
-update range_parted set b = b + 1 where b = 10;
+UPDATE range_parted set a = 'b', c = 120 WHERE a = 'a' and c = 200;
+
+-- RLS policy expression contains whole row.
+
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+CREATE POLICY policy_range_parted_wholerow on range_parted AS RESTRICTIVE for UPDATE USING (true)
+   WITH CHECK (range_parted = row('b', 10, 112, 1, NULL)::range_parted);
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- ok, should pass the RLS check
+UPDATE range_parted set a = 'b', c = 112 WHERE a = 'a' and c = 200;
+RESET SESSION AUTHORIZATION;
+:init_range_parted;
+SET SESSION AUTHORIZATION regress_range_parted_user;
+-- fail, the whole row RLS check should fail
+UPDATE range_parted set a = 'b', c = 116 WHERE a = 'a' and c = 200;
+
+-- Cleanup
+RESET SESSION AUTHORIZATION;
+DROP POLICY policy_range_parted ON range_parted;
+DROP POLICY policy_range_parted_subplan ON range_parted;
+DROP POLICY policy_range_parted_wholerow ON range_parted;
+REVOKE ALL ON range_parted, mintab FROM regress_range_parted_user;
+DROP USER regress_range_parted_user;
+DROP TABLE mintab;
+
+
+-- statement triggers with update row movement
+---------------------------------------------------
+
+:init_range_parted;
+
+CREATE FUNCTION trigfunc() returns trigger language plpgsql as
+$$
+  begin
+    raise notice 'trigger = % fired on table % during %',
+                 TG_NAME, TG_TABLE_NAME, TG_OP;
+    return null;
+  end;
+$$;
+-- Triggers on root partition
+CREATE TRIGGER parent_delete_trig
+  AFTER DELETE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_update_trig
+  AFTER UPDATE ON range_parted for each statement execute procedure trigfunc();
+CREATE TRIGGER parent_insert_trig
+  AFTER INSERT ON range_parted for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_c_1_100
+CREATE TRIGGER c1_delete_trig
+  AFTER DELETE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_update_trig
+  AFTER UPDATE ON part_c_1_100 for each statement execute procedure trigfunc();
+CREATE TRIGGER c1_insert_trig
+  AFTER INSERT ON part_c_1_100 for each statement execute procedure trigfunc();
+
+-- Triggers on leaf partition part_d_1_15
+CREATE TRIGGER d1_delete_trig
+  AFTER DELETE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_update_trig
+  AFTER UPDATE ON part_d_1_15 for each statement execute procedure trigfunc();
+CREATE TRIGGER d1_insert_trig
+  AFTER INSERT ON part_d_1_15 for each statement execute procedure trigfunc();
+-- Triggers on leaf partition part_d_15_20
+CREATE TRIGGER d15_delete_trig
+  AFTER DELETE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_update_trig
+  AFTER UPDATE ON part_d_15_20 for each statement execute procedure trigfunc();
+CREATE TRIGGER d15_insert_trig
+  AFTER INSERT ON part_d_15_20 for each statement execute procedure trigfunc();
+
+-- Move all rows from part_c_100_200 to part_c_1_100. None of the delete or
+-- insert statement triggers should be fired.
+UPDATE range_parted set c = c - 50 WHERE c > 97;
+:show_data;
+
+DROP TRIGGER parent_delete_trig ON range_parted;
+DROP TRIGGER parent_update_trig ON range_parted;
+DROP TRIGGER parent_insert_trig ON range_parted;
+DROP TRIGGER c1_delete_trig ON part_c_1_100;
+DROP TRIGGER c1_update_trig ON part_c_1_100;
+DROP TRIGGER c1_insert_trig ON part_c_1_100;
+DROP TRIGGER d1_delete_trig ON part_d_1_15;
+DROP TRIGGER d1_update_trig ON part_d_1_15;
+DROP TRIGGER d1_insert_trig ON part_d_1_15;
+DROP TRIGGER d15_delete_trig ON part_d_15_20;
+DROP TRIGGER d15_update_trig ON part_d_15_20;
+DROP TRIGGER d15_insert_trig ON part_d_15_20;
+
 
 -- Creating default partition for range
+:init_range_parted;
 create table part_def partition of range_parted default;
 \d+ part_def
 insert into range_parted values ('c', 9);
@@ -134,19 +447,121 @@ update part_def set a = 'd' where a = 'c';
 -- fail
 update part_def set a = 'a' where a = 'd';
 
-create table list_parted (
+:show_data;
+
+-- Update row movement from non-default to default partition.
+-- fail, default partition is not under part_a_10_a_20;
+UPDATE part_a_10_a_20 set a = 'ad' WHERE a = 'a';
+-- ok
+UPDATE range_parted set a = 'ad' WHERE a = 'a';
+UPDATE range_parted set a = 'bd' WHERE a = 'b';
+:show_data;
+-- Update row movement from default to non-default partitions.
+-- ok
+UPDATE range_parted set a = 'a' WHERE a = 'ad';
+UPDATE range_parted set a = 'b' WHERE a = 'bd';
+:show_data;
+
+-- Cleanup: range_parted no longer needed.
+DROP TABLE range_parted;
+
+CREATE TABLE list_parted (
 	a text,
 	b int
-) partition by list (a);
-create table list_part1  partition of list_parted for values in ('a', 'b');
-create table list_default partition of list_parted default;
-insert into list_part1 values ('a', 1);
-insert into list_default values ('d', 10);
+) PARTITION BY list (a);
+CREATE TABLE list_part1  PARTITION OF list_parted for VALUES in ('a', 'b');
+CREATE TABLE list_default PARTITION OF list_parted default;
+INSERT into list_part1 VALUES ('a', 1);
+INSERT into list_default VALUES ('d', 10);
 
 -- fail
-update list_default set a = 'a' where a = 'd';
+UPDATE list_default set a = 'a' WHERE a = 'd';
 -- ok
-update list_default set a = 'x' where a = 'd';
+UPDATE list_default set a = 'x' WHERE a = 'd';
+
+DROP TABLE list_parted;
+
+--------------
+-- Some more update-partition-key test scenarios below. This time use list
+-- partitions.
+--------------
+
+-- Setup for list partitions
+CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a);
+CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b);
+
+CREATE TABLE sub_part1(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1);
+CREATE TABLE sub_part2(b int, c int8, a numeric);
+ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2);
+
+CREATE TABLE list_part1(a numeric, b int, c int8);
+ALTER TABLE list_parted ATTACH PARTITION list_part1 for VALUES in (2,3);
+
+INSERT into list_parted VALUES (2,5,50);
+INSERT into list_parted VALUES (3,6,60);
+INSERT into sub_parted VALUES (1,1,60);
+INSERT into sub_parted VALUES (1,2,10);
+
+-- Test partition constraint violation when intermediate ancestor is used and
+-- constraint is inherited from upper root.
+UPDATE sub_parted set a = 2 WHERE c = 10;
+
+-- Test update-partition-key, where the unpruned partitions do not have their
+-- partition keys updated.
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+UPDATE list_parted set b = c + a WHERE a = 2;
+SELECT tableoid::regclass::text , * FROM list_parted WHERE a = 2 ORDER BY 1;
+
+
+-----------
+-- Tests for BR UPDATE triggers changing the partition key.
+-----------
+CREATE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   NEW.b = 2; -- This is changing partition key column.
+   return NEW;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER parted_mod_b before update on sub_part1
+   for each row execute procedure func_parted_mod_b();
+
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+
+-- This should do the tuple routing even though there is no explicit
+-- partition-key update, because there is a trigger on sub_part1.
+UPDATE list_parted set c = 70 WHERE b  = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+
+DROP TRIGGER parted_mod_b ON sub_part1;
+
+-- If BR DELETE trigger prevented DELETE from happening, we should also skip
+-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
+CREATE OR REPLACE FUNCTION func_parted_mod_b() returns trigger as $$
+BEGIN
+   raise notice 'Trigger: Got OLD row %, but returning NULL', OLD;
+   return NULL;
+END $$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_skip_delete before delete on sub_part2
+   for each row execute procedure func_parted_mod_b();
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+-- Drop the trigger. Now the row should be moved.
+DROP TRIGGER trig_skip_delete ON sub_part2;
+UPDATE list_parted set b = 1 WHERE c = 70;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+DROP FUNCTION func_parted_mod_b();
+
+-- UPDATE partition-key with FROM clause. If join produces multiple output
+-- rows for the same row to be modified, we should tuple-route the row only once.
+-- There should not be any rows inserted.
+CREATE TABLE non_parted (id int);
+INSERT into non_parted VALUES (1), (1), (1), (2), (2), (2), (3), (3), (3);
+UPDATE list_parted t1 set a = 2 FROM non_parted t2 WHERE t1.a = t2.id and a = 1;
+SELECT tableoid::regclass::text , * FROM list_parted ORDER BY 1, 2, 3, 4;
+DROP TABLE non_parted;
+
+-- Cleanup: list_parted no longer needed.
+DROP TABLE list_parted;
 
 -- create custom operator class and hash function, for the same reason
 -- explained in alter_table.sql
@@ -169,13 +584,12 @@ insert into hpart4 values (3, 4);
 
 -- fail
 update hpart1 set a = 3, b=4 where a = 1;
+-- ok, row movement
 update hash_parted set b = b - 1 where b = 1;
 -- ok
 update hash_parted set b = b + 8 where b = 1;
 
 -- cleanup
-drop table range_parted;
-drop table list_parted;
 drop table hash_parted;
 drop operator class custom_opclass using hash;
 drop function dummy_hashint4(a int4, seed int8);

#247

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#246)

Re: [HACKERS] UPDATE of partition key

On Fri, Jan 19, 2018 at 4:37 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached rebased patch.

Committed with a bunch of mostly-cosmetic revisions. I removed the
macro you added, which has a multiple evaluation hazard, and just put
that logic back into the function. I don't think it's likely to
matter for performance, and this way is safer. I removed an inline
keyword from another static function as well; better to let the
compiler decide what to do. I rearranged a few things to shorten some
long lines, too. Aside from that I think all of the changes I made
were to comments and documentation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#248

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Robert Haas (#247)

Re: [HACKERS] UPDATE of partition key

Robert Haas <robertmhaas@gmail.com> writes:

Committed with a bunch of mostly-cosmetic revisions.

Buildfarm member skink has been unhappy since this patch went in.
Running the regression tests under valgrind easily reproduces the
failure. Now, I might be wrong about which of the patches committed
on Friday caused the unhappiness, but the valgrind backtrace sure
looks like it's to do with partition routing:

==00:00:05:49.683 17549== Invalid read of size 4
==00:00:05:49.683 17549== at 0x62A8BA: ExecCleanupTupleRouting (execPartition.c:483)
==00:00:05:49.683 17549== by 0x6483AA: ExecEndModifyTable (nodeModifyTable.c:2682)
==00:00:05:49.683 17549== by 0x627139: standard_ExecutorEnd (execMain.c:1604)
==00:00:05:49.683 17549== by 0x7780AF: ProcessQuery (pquery.c:206)
==00:00:05:49.683 17549== by 0x7782E4: PortalRunMulti (pquery.c:1286)
==00:00:05:49.683 17549== by 0x778AAF: PortalRun (pquery.c:799)
==00:00:05:49.683 17549== by 0x774E4C: exec_simple_query (postgres.c:1120)
==00:00:05:49.683 17549== by 0x776C17: PostgresMain (postgres.c:4143)
==00:00:05:49.683 17549== by 0x6FA419: PostmasterMain (postmaster.c:4412)
==00:00:05:49.683 17549== by 0x66E51F: main (main.c:228)
==00:00:05:49.683 17549== Address 0xe25e298 is 2,088 bytes inside a block of size 32,768 alloc'd
==00:00:05:49.683 17549== at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==00:00:05:49.683 17549== by 0x89EB15: AllocSetAlloc (aset.c:945)
==00:00:05:49.683 17549== by 0x8A7577: palloc (mcxt.c:848)
==00:00:05:49.683 17549== by 0x671969: new_list (list.c:68)
==00:00:05:49.683 17549== by 0x672859: lappend_oid (list.c:169)
==00:00:05:49.683 17549== by 0x55330E: find_inheritance_children (pg_inherits.c:144)
==00:00:05:49.683 17549== by 0x553447: find_all_inheritors (pg_inherits.c:203)
==00:00:05:49.683 17549== by 0x62AC76: ExecSetupPartitionTupleRouting (execPartition.c:68)
==00:00:05:49.683 17549== by 0x64949D: ExecInitModifyTable (nodeModifyTable.c:2232)
==00:00:05:49.683 17549== by 0x62BBE8: ExecInitNode (execProcnode.c:174)
==00:00:05:49.683 17549== by 0x627B53: standard_ExecutorStart (execMain.c:1043)
==00:00:05:49.683 17549== by 0x778046: ProcessQuery (pquery.c:156)

(This is my local result, but skink's log looks about the same.)

regards, tom lane

#249

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Tom Lane (#248)

Re: [HACKERS] UPDATE of partition key

On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Committed with a bunch of mostly-cosmetic revisions.

Buildfarm member skink has been unhappy since this patch went in.
Running the regression tests under valgrind easily reproduces the
failure. Now, I might be wrong about which of the patches committed
on Friday caused the unhappiness, but the valgrind backtrace sure
looks like it's to do with partition routing:

Yeah, that must be the fault of this patch. We assign to
proute->subplan_partition_offsets[update_rri_index] from
update_rri_index = 0 .. num_update_rri, and there's an Assert() at the
bottom of this function that checks this, so probably this is indexing
off the end of the array. I bet the issue happens when we find all of
the UPDATE result rels while there are still partitions left; then,
subplan_index will be equal to the length of the
proute->subplan_partition_offsets array and we'll be indexing just off
the end.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#250

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Robert Haas (#249)

1 attachment(s)

Re: [HACKERS] UPDATE of partition key

On 22 January 2018 at 02:40, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jan 21, 2018 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Committed with a bunch of mostly-cosmetic revisions.

Buildfarm member skink has been unhappy since this patch went in.
Running the regression tests under valgrind easily reproduces the
failure. Now, I might be wrong about which of the patches committed
on Friday caused the unhappiness, but the valgrind backtrace sure
looks like it's to do with partition routing:

Yeah, that must be the fault of this patch. We assign to
proute->subplan_partition_offsets[update_rri_index] from
update_rri_index = 0 .. num_update_rri, and there's an Assert() at the
bottom of this function that checks this, so probably this is indexing
off the end of the array. I bet the issue happens when we find all of
the UPDATE result rels while there are still partitions left; then,
subplan_index will be equal to the length of the
proute->subplan_partition_offsets array and we'll be indexing just off
the end.

Yes, right, that's what is happening. It is not happening on an Assert
though (there is no assert in that function). It is happening when we
try to access the array here :

if (proute->subplan_partition_offsets &&
proute->subplan_partition_offsets[subplan_index] == i)

Attached is a fix, where I have introduced another field
PartitionTupleRouting.num_ subplan_partition_offsets, so that above,
we can add another condition (subplan_index <
proute->num_subplan_partition_offsets) in order to stop accessing the
array once we are done with all the offset array elements.

Ran the update.sql test with valgrind enabled on my laptop, and the
valgrind output now does not show errors.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

fix_valgrind_issue.patchapplication/octet-stream; name=fix_valgrind_issue.patchDownload

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 89b7bb4..106a96d 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -87,6 +87,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 		num_update_rri = list_length(node->plans);
 		proute->subplan_partition_offsets =
 			palloc(num_update_rri * sizeof(int));
+		proute->num_subplan_partition_offsets = num_update_rri;
 
 		/*
 		 * We need an additional tuple slot for storing transient tuples that
@@ -481,6 +482,7 @@ ExecCleanupTupleRouting(PartitionTupleRouting *proute)
 		 * result rels are present in the UPDATE subplans.
 		 */
 		if (proute->subplan_partition_offsets &&
+			subplan_index < proute->num_subplan_partition_offsets &&
 			proute->subplan_partition_offsets[subplan_index] == i)
 		{
 			subplan_index++;
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 6c2f8d4..828e1b0 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1812,7 +1812,8 @@ tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 		 * If subplan-indexed array is NULL, things should have been arranged
 		 * to convert the subplan index to partition index.
 		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL);
+		Assert(proute && proute->subplan_partition_offsets != NULL &&
+			   whichplan < proute->num_subplan_partition_offsets);
 
 		leaf_index = proute->subplan_partition_offsets[whichplan];
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 18e0812..77c39e3c 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -80,6 +80,7 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
  *								element of this array has the index into the
  *								corresponding partition in partitions array.
+ * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
  * partition_tuple_slot			TupleTableSlot to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
@@ -96,6 +97,7 @@ typedef struct PartitionTupleRouting
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
 	int		   *subplan_partition_offsets;
+	int		   num_subplan_partition_offsets;
 	TupleTableSlot *partition_tuple_slot;
 	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;

#251

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#250)

Re: [HACKERS] UPDATE of partition key

On Mon, Jan 22, 2018 at 2:44 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Yes, right, that's what is happening. It is not happening on an Assert
though (there is no assert in that function). It is happening when we
try to access the array here :

if (proute->subplan_partition_offsets &&
proute->subplan_partition_offsets[subplan_index] == i)

Attached is a fix, where I have introduced another field
PartitionTupleRouting.num_ subplan_partition_offsets, so that above,
we can add another condition (subplan_index <
proute->num_subplan_partition_offsets) in order to stop accessing the
array once we are done with all the offset array elements.

Ran the update.sql test with valgrind enabled on my laptop, and the
valgrind output now does not show errors.

Tom, do you want to double-check that this fixes it for you?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#252

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Robert Haas (#251)

Re: [HACKERS] UPDATE of partition key

Robert Haas <robertmhaas@gmail.com> writes:

Tom, do you want to double-check that this fixes it for you?

I can confirm that a valgrind run succeeded for me with the patch
in place.

regards, tom lane

#253

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Tom Lane (#252)

Re: [HACKERS] UPDATE of partition key

On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Tom, do you want to double-check that this fixes it for you?

I can confirm that a valgrind run succeeded for me with the patch
in place.

Committed. Sorry for the delay.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#254

Thomas Munro

thomas.munro@enterprisedb.com

almost 8 years ago

In reply to: Robert Haas (#253)

Re: [HACKERS] UPDATE of partition key

On Thu, Jan 25, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 22, 2018 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Tom, do you want to double-check that this fixes it for you?

I can confirm that a valgrind run succeeded for me with the patch
in place.

Committed. Sorry for the delay.

FYI I'm planning to look into adding a valgrind check to the
commitfest CI thing I run so we can catch these earlier without
committer involvement. It's super slow because of all those pesky
regression tests so I'll probably need to improve the scheduling logic
a bit to make it useful first (prioritising new patches or something,
since otherwise it'll take up to multiple days to get around to
valgrind-testing any given patch...).

--
Thomas Munro
http://www.enterprisedb.com