Restrict concurrent update/delete with UPDATE of partition key

Started by amul sulover 8 years ago84 messages

sulamul@gmail.com

over 8 years ago

1 attachment(s)

Hi All,

Attaching POC patch that throws an error in the case of a concurrent update
to an already deleted tuple due to UPDATE of partition key[1].

In a normal update new tuple is linked to the old one via ctid forming
a chain of tuple versions but UPDATE of partition key[1] move tuple
from one partition to an another partition table which breaks that chain.

Consider a following[2] concurrent update case where one session trying to
update a row that's locked for a concurrent update by the another session
cause tuple movement to the another partition.

create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'ABC');

----------- session 1 -----------
postgres=# begin;
BEGIN
postgres=# update foo set a=2 where a=1;
UPDATE 1

----------- session 2 -----------
postgres=# update foo set b='EFG' where a=1;

….. wait state ……

----------- session 1 -----------
postgres=# commit;
COMMIT

----------- session 2 -----------
UPDATE 0

This UPDATE 0 is the problematic, see Greg Stark's update[3] explains
why we need an error.

To throw an error we need an indicator that the targeted row has been
already moved to the another partition, and for that setting a ctid.ip_blkid to
InvalidBlockId looks viable option for now.

The attached patch incorporates the following logic suggested by Amit
Kapila[4]:

"We can pass a flag say row_moved (or require_row_movement) to heap_delete
which will in turn set InvalidBlockId in ctid instead of setting it to
self. Then the
ExecUpdate needs to check for the same and return an error when heap_update is
not successful (result != HeapTupleMayBeUpdated)."

1] /messages/by-id/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb=yJO3u3QhyP8EEPQ@mail.gmail.com
2] With /messages/by-id/CAJ3gD9fzD4jBpv+zXqZYnW=h9JXUFG9E7NGdA9gR_JJbOj=Q5A@mail.gmail.com
patch applied.
3] /messages/by-id/CAM-w4HPis7rbnwi+oXjnouqMSRAC5DeVcMdxEXTMfDos1kaYPQ@mail.gmail.com
4] /messages/by-id/CAA4eK1KEZQ+CyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ@mail.gmail.com

Regards,
Amul

Attachments:

0001-POC-Invalidate-ip_blkid-v1.patchapplication/octet-stream; name=0001-POC-Invalidate-ip_blkid-v1.patchDownload

From db6277763ce360a1f6883891b40953f285b315e9 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Wed, 27 Sep 2017 15:56:03 +0530
Subject: [PATCH] POC Invalidate ip_blkid v1

---
 src/backend/access/heap/heapam.c       | 11 +++++++++--
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 21 +++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 6 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d03f544d26..494feb2dc7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3027,7 +3027,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3295,6 +3295,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to an another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3420,7 +3427,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 873156b1f3..c1e6bca795 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index d48da8e603..e6de8d4e30 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2703,6 +2703,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 93895600a5..1b388e6fbd 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to an another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 03bf01c808..60e8e8f48a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -760,7 +760,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *delete_skipped,
 		   bool process_returning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -851,7 +852,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -897,6 +899,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1182,7 +1189,7 @@ lreplace:;
 			 * from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &delete_skipped, false, false);
+					   &delete_skipped, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (for e.g. trigger
@@ -1293,6 +1300,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1312,6 +1324,7 @@ lreplace:;
 						goto lreplace;
 					}
 				}
+
 				/* tuple already deleted; nothing to do */
 				return NULL;
 
@@ -2051,7 +2064,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024e92..76f56cfc94 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
-- 
2.14.1

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: amul sul (#1)

Re: Restrict concurrent update/delete with UPDATE of partition key

On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:

Attaching POC patch that throws an error in the case of a concurrent update
to an already deleted tuple due to UPDATE of partition key[1].

In a normal update new tuple is linked to the old one via ctid forming
a chain of tuple versions but UPDATE of partition key[1] move tuple
from one partition to an another partition table which breaks that chain.

This patch needs a rebase. It has one whitespace-only hunk that
should possibly be excluded.

I think the basic idea of this is sound. Either you or Amit need to
document the behavior in the user-facing documentation, and it needs
tests that hit every single one of the new error checks you've added
(obviously, the tests will only work in combination with Amit's
patch). The isolation should be sufficient to write such tests.

It needs some more extensive comments as well. The fact that we're
assigning a meaning to ip_blkid -> InvalidBlockNumber is a big deal,
and should at least be documented in itemptr.h in the comments for the
ItemPointerData structure.

I suspect ExecOnConflictUpdate needs an update for the
HeapTupleUpdated case similar to what you've done elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

amul sul

sulamul@gmail.com

about 8 years ago

In reply to: Robert Haas (#2)

2 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:

Attaching POC patch that throws an error in the case of a concurrent update
to an already deleted tuple due to UPDATE of partition key[1].

In a normal update new tuple is linked to the old one via ctid forming
a chain of tuple versions but UPDATE of partition key[1] move tuple
from one partition to an another partition table which breaks that chain.

This patch needs a rebase. It has one whitespace-only hunk that
should possibly be excluded.

Thanks for looking at the patch.

I think the basic idea of this is sound. Either you or Amit need to
document the behavior in the user-facing documentation, and it needs
tests that hit every single one of the new error checks you've added
(obviously, the tests will only work in combination with Amit's
patch). The isolation should be sufficient to write such tests.

It needs some more extensive comments as well. The fact that we're
assigning a meaning to ip_blkid -> InvalidBlockNumber is a big deal,
and should at least be documented in itemptr.h in the comments for the
ItemPointerData structure.

I suspect ExecOnConflictUpdate needs an update for the
HeapTupleUpdated case similar to what you've done elsewhere.

UPDATE of partition key v25[1] has documented this concurrent scenario,
I need to rework on that document paragraph to include this behaviour, will
discuss with Amit.

Attached 0001 patch includes error check for 8 functions, out of 8 I am able
to build isolation test for 4 functions -- ExecUpdate,ExecDelete,
GetTupleForTrigger & ExecLockRows.

And remaining are EvalPlanQualFetch, ExecOnConflictUpdate,
RelationFindReplTupleByIndex & RelationFindReplTupleSeq. Note that check in
RelationFindReplTupleByIndex & RelationFindReplTupleSeq will have LOG not an
ERROR.

Any help/suggestion to build test for these function would be much appreciated.

1] /messages/by-id/CAJ3gD9f4Um99sOJmVSEPj783VWw5GbXkgU3OWcYZJv=ipjTkAw@mail.gmail.com

Regards,
Amul

Attachments:

0001-POC-Invalidate-ip_blkid-v2.patchapplication/octet-stream; name=0001-POC-Invalidate-ip_blkid-v2.patchDownload

From 5c0b8b16c6def437bdee17ce17bef69f6bfc46d5 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Thu, 23 Nov 2017 16:29:48 +0530
Subject: [PATCH 1/2] POC Invalidate ip_blkid v2

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
---
 src/backend/access/heap/heapam.c       | 13 +++++++++++--
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/execReplication.c |  8 ++++++++
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 25 +++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/storage/itemptr.h          |  4 +++-
 8 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3acef279f4..0363e21408 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3014,6 +3014,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	row_moved - true iff the tuple is being moved to another partition
+ *				table due to an update of partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3029,7 +3031,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3297,6 +3299,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to an another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3422,7 +3431,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 73ec87218b..7cd63e6d46 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 5ec92d5d01..e9dec76d38 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index e11f7cb9b2..6847898a34 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -194,6 +194,10 @@ retry:
 				ereport(LOG,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -352,6 +356,10 @@ retry:
 				ereport(LOG,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 93895600a5..1b388e6fbd 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to an another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a0d8259663..4153fb0eea 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -771,7 +771,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tuple_deleted,
 		   bool process_returning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -864,7 +865,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -910,6 +912,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1194,7 +1201,7 @@ lreplace:;
 			 * from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1311,6 +1318,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1330,6 +1342,7 @@ lreplace:;
 						goto lreplace;
 					}
 				}
+
 				/* tuple already deleted; nothing to do */
 				return NULL;
 
@@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 				ereport(ERROR,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
+			if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("tuple to be updated was already moved to an another partition due to concurrent update")));
 
 			/*
 			 * Tell caller to try again from the very start.
@@ -2068,7 +2085,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024e92..76f56cfc94 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 8f8e22444a..2cd02ab811 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * partition key.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
-- 
2.14.1

0002-isolation-tests-v1.patchapplication/octet-stream; name=0002-isolation-tests-v1.patchDownload

From a50723ddec935dc3182e641aa7fe84e78fb2d857 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Thu, 23 Nov 2017 16:30:03 +0530
Subject: [PATCH 2/2] isolation tests v1

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

TODOs:
 Tests for the following function yet to add.
 1. EvalPlanQualFetch
 2. ExecOnConflictUpdate
 3. RelationFindReplTupleByIndex
 4. RelationFindReplTupleSeq
---
 .../isolation/expected/partition-key-update-1.out  | 35 +++++++++++++++++++
 .../isolation/expected/partition-key-update-2.out  | 18 ++++++++++
 .../isolation/expected/partition-key-update-3.out  |  8 +++++
 src/test/isolation/isolation_schedule              |  3 ++
 .../isolation/specs/partition-key-update-1.spec    | 37 ++++++++++++++++++++
 .../isolation/specs/partition-key-update-2.spec    | 39 ++++++++++++++++++++++
 .../isolation/specs/partition-key-update-3.spec    | 30 +++++++++++++++++
 7 files changed, 170 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..27820ea900
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,35 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to an another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+
+starting permutation: s1u s2d s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be updated was already moved to an another partition due to concurrent update
+
+starting permutation: s2d s1u s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..afe9415ea4
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,18 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to an another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..63714a0cf6
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,8 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to an another partition due to concurrent update
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 32c965b2a0..e9a94996b4 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -62,3 +62,6 @@ test: sequence-ddl
 test: async-notify
 test: vacuum-reltuples
 test: timeouts
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..db76c9a9b5
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,37 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
+
+permutation "s1u" "s1c" "s2d"
+permutation "s1u" "s2d" "s1c"
+permutation "s2d" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..b09e76ce21
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,39 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..c1f547d9ba
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,30 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+
+permutation "s1u3" "s2i" "s1c"
-- 
2.14.1

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: amul sul (#3)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Nov 23, 2017 at 6:48 AM, amul sul <sulamul@gmail.com> wrote:

And remaining are EvalPlanQualFetch, ExecOnConflictUpdate,
RelationFindReplTupleByIndex & RelationFindReplTupleSeq. Note that check in
RelationFindReplTupleByIndex & RelationFindReplTupleSeq will have LOG not an
ERROR.

The first one is going to come up when you have, for example, two
concurrent updates targeting the same row, and the second one when you
have an ON CONFLICT UPDATE clause. I guess the latter two are
probably related to logical replication, and maybe not easy to test
via an automated regression test.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Alvaro Herrera

alvherre@alvh.no-ip.org

about 8 years ago

In reply to: amul sul (#3)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

A typo in all the messages the patch adds:
"to an another" -> "to another"

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Amit Kapila

amit.kapila16@gmail.com

about 8 years ago

In reply to: amul sul (#3)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Nov 23, 2017 at 5:18 PM, amul sul <sulamul@gmail.com> wrote:

On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:

Attaching POC patch that throws an error in the case of a concurrent update
to an already deleted tuple due to UPDATE of partition key[1].

In a normal update new tuple is linked to the old one via ctid forming
a chain of tuple versions but UPDATE of partition key[1] move tuple
from one partition to an another partition table which breaks that chain.

This patch needs a rebase. It has one whitespace-only hunk that
should possibly be excluded.

Thanks for looking at the patch.

I think the basic idea of this is sound. Either you or Amit need to
document the behavior in the user-facing documentation, and it needs
tests that hit every single one of the new error checks you've added
(obviously, the tests will only work in combination with Amit's
patch). The isolation should be sufficient to write such tests.

It needs some more extensive comments as well. The fact that we're
assigning a meaning to ip_blkid -> InvalidBlockNumber is a big deal,
and should at least be documented in itemptr.h in the comments for the
ItemPointerData structure.

I suspect ExecOnConflictUpdate needs an update for the
HeapTupleUpdated case similar to what you've done elsewhere.

UPDATE of partition key v25[1] has documented this concurrent scenario,
I need to rework on that document paragraph to include this behaviour, will
discuss with Amit.

Attached 0001 patch includes error check for 8 functions, out of 8 I am able
to build isolation test for 4 functions -- ExecUpdate,ExecDelete,
GetTupleForTrigger & ExecLockRows.

Few comments:

1.
@@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
  ereport(ERROR,
  (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));

Why do you think we need this check in the OnConflictUpdate path? I
think we don't it here because we are going to relinquish this version
of the tuple and will start again and might fetch some other row
version. Also, we don't support Insert .. On Conflict Update with
partitioned tables, see[1]/messages/by-id/7ff1e8ec-dc39-96b1-7f47-ff5965dceeac@lab.ntt.co.jp, which is also an indication that at the
very least we don't need it now.

2.
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation
relation, int lockmode,
  ereport(ERROR,
  (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));

..
..
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
  ereport(ERROR,
  (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to an another partition
due to concurrent update")));
+

At some places after heap_lock_tuple the error message says "tuple to
be updated .." and other places it says "tuple to be locked ..". Can
we use the same message consistently? I think it would be better to
use the second one.

3.
}
+
/* tuple already deleted; nothing to do */
return NULL;

Spurious whitespace.

4. There is no need to use *POC* in the name of the patch. I think
this is no more a POC patch.

[1]: /messages/by-id/7ff1e8ec-dc39-96b1-7f47-ff5965dceeac@lab.ntt.co.jp

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

amul sul

sulamul@gmail.com

about 8 years ago

In reply to: Amit Kapila (#6)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Sat, Nov 25, 2017 at 11:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 23, 2017 at 5:18 PM, amul sul <sulamul@gmail.com> wrote:

On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:

[...]

Few comments:

Thanks for looking at the patch, please find my comments inline:

1.
@@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));
Why do you think we need this check in the OnConflictUpdate path? I
think we don't it here because we are going to relinquish this version
of the tuple and will start again and might fetch some other row
version. Also, we don't support Insert .. On Conflict Update with
partitioned tables, see[1], which is also an indication that at the
very least we don't need it now.

Agreed, even though this case will never going to be anytime soon
shouldn't we have a check for invalid block id? IMHO, we should have
this check and error report or assert, thoughts?

2.
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation
relation, int lockmode,
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));

..
..
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to an another partition
due to concurrent update")));
+

Okay, will use "tuple to be locked"

3.
}
+
/* tuple already deleted; nothing to do */
return NULL;

Spurious whitespace.

Sorry about that, will fix this.

4. There is no need to use *POC* in the name of the patch. I think
this is no more a POC patch.

Understood.

Regards,
Amul

amul sul

sulamul@gmail.com

about 8 years ago

In reply to: Alvaro Herrera (#5)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Nov 24, 2017 at 9:37 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

A typo in all the messages the patch adds:
"to an another" -> "to another"

Thanks for the looking into the patch, will fix in the next version.

Regards,
Amul

Amit Kapila

amit.kapila16@gmail.com

about 8 years ago

In reply to: amul sul (#7)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Tue, Nov 28, 2017 at 5:58 PM, amul sul <sulamul@gmail.com> wrote:

On Sat, Nov 25, 2017 at 11:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 23, 2017 at 5:18 PM, amul sul <sulamul@gmail.com> wrote:

On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:

[...]

Few comments:

Thanks for looking at the patch, please find my comments inline:
1.
@@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));
Why do you think we need this check in the OnConflictUpdate path? I
think we don't it here because we are going to relinquish this version
of the tuple and will start again and might fetch some other row
version. Also, we don't support Insert .. On Conflict Update with
partitioned tables, see[1], which is also an indication that at the
very least we don't need it now.
Agreed, even though this case will never going to be anytime soon
shouldn't we have a check for invalid block id? IMHO, we should have
this check and error report or assert, thoughts?

I feel adding code which can't be hit (even if it is error handling)
is not a good idea. I think having an Assert should be okay, but
please write comments to explain the reason for adding an Assert.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10

amul sul

sulamul@gmail.com

about 8 years ago

In reply to: Amit Kapila (#9)

2 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Wed, Nov 29, 2017 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 28, 2017 at 5:58 PM, amul sul <sulamul@gmail.com> wrote:
On Sat, Nov 25, 2017 at 11:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 23, 2017 at 5:18 PM, amul sul <sulamul@gmail.com> wrote:

On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:

[...]

Few comments:

Thanks for looking at the patch, please find my comments inline:
1.
@@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));
Why do you think we need this check in the OnConflictUpdate path? I
think we don't it here because we are going to relinquish this version
of the tuple and will start again and might fetch some other row
version. Also, we don't support Insert .. On Conflict Update with
partitioned tables, see[1], which is also an indication that at the
very least we don't need it now.
Agreed, even though this case will never going to be anytime soon
shouldn't we have a check for invalid block id? IMHO, we should have
this check and error report or assert, thoughts?
I feel adding code which can't be hit (even if it is error handling)
is not a good idea. I think having an Assert should be okay, but
please write comments to explain the reason for adding an Assert.

Agree, updated in the attached patch. Patch 0001 also includes your
previous review comment[1] and typo correction suggested by Alvaro[2].

Patch 0002 still missing tests for EvalPlanQualFetch() function. I think we
could skip that because direct/indirect callers of EvalPlanQualFetch() are
GetTupleForTrigger, ExecDelete, ExecUpdate & ExecLockRows got the required test
coverage in the attached patch.

1] /messages/by-id/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
2] /messages/by-id/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql

Regards,
Amul

Attachments:

0002-isolation-tests-v2.patchapplication/octet-stream; name=0002-isolation-tests-v2.patchDownload

From 60977e7ff464a6abf2b1ba549fe8fe1c7ad5337a Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Wed, 29 Nov 2017 15:20:31 +0530
Subject: [PATCH 2/2] isolation tests v2

v2:
 - Error message changed.
 - Can't add isolation test[1] for
 	RelationFindReplTupleByIndex & RelationFindReplTupleSeq
 - In ExecOnConflictUpdate, the error report is converted to assert
   check.

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

 ------------
  TODOs:
 ------------
 Tests for the following function yet to add.
 1. EvalPlanQualFetch

 ------------
  References:
 ------------
 1] https://postgr.es/m/CA+TgmoYsMRo2PHFTGUFifv4ZSCZ9LNJASbOyb=9it2=UA4j4vw@mail.gmail.com
---
 .../isolation/expected/partition-key-update-1.out  | 35 +++++++++++++++++++
 .../isolation/expected/partition-key-update-2.out  | 18 ++++++++++
 .../isolation/expected/partition-key-update-3.out  |  8 +++++
 src/test/isolation/isolation_schedule              |  3 ++
 .../isolation/specs/partition-key-update-1.spec    | 37 ++++++++++++++++++++
 .../isolation/specs/partition-key-update-2.spec    | 39 ++++++++++++++++++++++
 .../isolation/specs/partition-key-update-3.spec    | 30 +++++++++++++++++
 7 files changed, 170 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..c33960a0d2
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,35 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+
+starting permutation: s1u s2d s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2d s1u s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..195ec4cedf
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,18 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..1922bdce46
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,8 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 32c965b2a0..e9a94996b4 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -62,3 +62,6 @@ test: sequence-ddl
 test: async-notify
 test: vacuum-reltuples
 test: timeouts
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..db76c9a9b5
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,37 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
+
+permutation "s1u" "s1c" "s2d"
+permutation "s1u" "s2d" "s1c"
+permutation "s2d" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..b09e76ce21
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,39 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..c1f547d9ba
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,30 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+
+permutation "s1u3" "s2i" "s1c"
-- 
2.14.1

0001-Invalidate-ip_blkid-v3.patchapplication/octet-stream; name=0001-Invalidate-ip_blkid-v3.patchDownload

From e373a950edcbacd50e8f0d8a2bbf27674190f910 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Wed, 29 Nov 2017 15:20:21 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v3

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
 - typo in all error message and comment : "to an another" -> "to another"
 - error message change : "tuple to be updated" -> "tuple to be locked"
 - In ExecOnConflictUpdate(), error report converted into assert &
  comments added.

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
 3] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
 4] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql
---
 src/backend/access/heap/heapam.c       | 13 +++++++++++--
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/execReplication.c |  8 ++++++++
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 28 ++++++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/storage/itemptr.h          |  4 +++-
 8 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3acef279f4..3925e3b86b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3014,6 +3014,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	row_moved - true iff the tuple is being moved to another partition
+ *				table due to an update of partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3029,7 +3031,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3297,6 +3299,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3422,7 +3431,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 73ec87218b..97da2addac 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 5ec92d5d01..6719004ed4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index e11f7cb9b2..2d965597b0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -194,6 +194,10 @@ retry:
 				ereport(LOG,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -352,6 +356,10 @@ retry:
 				ereport(LOG,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 93895600a5..6459653ba0 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4d747cf3fb..20f349fe1d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -771,7 +771,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tuple_deleted,
 		   bool process_returning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -864,7 +865,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -910,6 +912,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1194,7 +1201,7 @@ lreplace:;
 			 * from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1311,6 +1318,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1481,6 +1493,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of partition key.
+			 */
+			Assert(BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2069,7 +2089,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024e92..76f56cfc94 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 8f8e22444a..2cd02ab811 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * partition key.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
-- 
2.14.1

#11

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: amul sul (#10)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Amul,

* amul sul (sulamul@gmail.com) wrote:

Agree, updated in the attached patch. Patch 0001 also includes your
previous review comment[1] and typo correction suggested by Alvaro[2].

Looks like this needs to be rebased (though the failures aren't too bad,
from what I'm seeing), so going to mark this back to Waiting For Author.
Hopefully this also helps to wake this thread up a bit and get another
review of it.

Thanks!

Stephen

#12

amul sul

sulamul@gmail.com

about 8 years ago

In reply to: Stephen Frost (#11)

2 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Jan 11, 2018 at 8:06 PM, Stephen Frost <sfrost@snowman.net> wrote:

Amul,

* amul sul (sulamul@gmail.com) wrote:

Agree, updated in the attached patch. Patch 0001 also includes your
previous review comment[1] and typo correction suggested by Alvaro[2].

Looks like this needs to be rebased (though the failures aren't too bad,
from what I'm seeing), so going to mark this back to Waiting For Author.
Hopefully this also helps to wake this thread up a bit and get another
review of it.

Thanks for looking at this thread, attached herewith an updated patch rebase on
'UPDATE of partition key v35' patch[1].

Regards,
Amul

1] /messages/by-id/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com

Attachments:

0001-Invalidate-ip_blkid-v4.patchapplication/octet-stream; name=0001-Invalidate-ip_blkid-v4.patchDownload

From c0ac767e5befc6b7e6a8d606307843f8d68d5d67 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Fri, 12 Jan 2018 11:19:37 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v4

v4: Rebased on "UPDATE of partition key v35" patch[5].

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
 - typo in all error message and comment : "to an another" -> "to another"
 - error message change : "tuple to be updated" -> "tuple to be locked"
 - In ExecOnConflictUpdate(), error report converted into assert &
  comments added.

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
 3] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
 4] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql
 5] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
---
 src/backend/access/heap/heapam.c       | 13 +++++++++++--
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/execReplication.c |  8 ++++++++
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 28 ++++++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/storage/itemptr.h          |  4 +++-
 8 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dbc8f2d6c7..afd5d79497 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3014,6 +3014,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	row_moved - true iff the tuple is being moved to another partition
+ *				table due to an update of partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3029,7 +3031,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3297,6 +3299,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3422,7 +3431,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index e8af18e254..f943666c40 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 16822e962a..4115604011 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 32891abbdf..9016d8fb11 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -194,6 +194,10 @@ retry:
 				ereport(LOG,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -352,6 +356,10 @@ retry:
 				ereport(LOG,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e9c0b23172..7f3cbaa00e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -708,7 +708,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tupleDeleted,
 		   bool processReturning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -799,7 +800,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -845,6 +847,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1149,7 +1156,7 @@ lreplace:;
 			 * from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1292,6 +1299,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1462,6 +1474,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of partition key.
+			 */
+			Assert(BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2055,7 +2075,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..44a211a740 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..79dceb414f 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * partition key.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
-- 
2.14.1

0002-isolation-tests-v3.patchapplication/octet-stream; name=0002-isolation-tests-v3.patchDownload

From df66ff9304da898ffe804c934e1c2fc3a8829a54 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Fri, 12 Jan 2018 11:30:40 +0530
Subject: [PATCH 2/2] isolation tests v3

v3:
 - Rebase on "UPDATE of partition key v35" patch[2] and
  latest maste head[3].

v2:
 - Error message changed.
 - Can't add isolation test[1] for
 	RelationFindReplTupleByIndex & RelationFindReplTupleSeq
 - In ExecOnConflictUpdate, the error report is converted to assert
   check.

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

 ------------
  TODOs:
 ------------
 Tests for the following function yet to add.
 1. EvalPlanQualFetch

 ------------
  References:
 ------------
 1] https://postgr.es/m/CA+TgmoYsMRo2PHFTGUFifv4ZSCZ9LNJASbOyb=9it2=UA4j4vw@mail.gmail.com
 2] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 3] Commit id bdb70c12b3a2e69eec6e51411df60d9f43ecc841
---
 .../isolation/expected/partition-key-update-1.out  | 35 +++++++++++++++++++
 .../isolation/expected/partition-key-update-2.out  | 18 ++++++++++
 .../isolation/expected/partition-key-update-3.out  |  8 +++++
 src/test/isolation/isolation_schedule              |  3 ++
 .../isolation/specs/partition-key-update-1.spec    | 37 ++++++++++++++++++++
 .../isolation/specs/partition-key-update-2.spec    | 39 ++++++++++++++++++++++
 .../isolation/specs/partition-key-update-3.spec    | 30 +++++++++++++++++
 7 files changed, 170 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..c33960a0d2
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,35 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+
+starting permutation: s1u s2d s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2d s1u s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..195ec4cedf
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,18 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..1922bdce46
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,8 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index befe676816..3545b1b758 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -65,3 +65,6 @@ test: async-notify
 test: vacuum-reltuples
 test: timeouts
 test: vacuum-concurrent-drop
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..db76c9a9b5
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,37 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
+
+permutation "s1u" "s1c" "s2d"
+permutation "s1u" "s2d" "s1c"
+permutation "s2d" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..b09e76ce21
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,39 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..c1f547d9ba
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,30 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+
+permutation "s1u3" "s2i" "s1c"
-- 
2.14.1

#13

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: amul sul (#12)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Jan 12, 2018 at 11:43 AM, amul sul <sulamul@gmail.com> wrote:

Thanks for looking at this thread, attached herewith an updated patch rebase on
'UPDATE of partition key v35' patch[1].

  ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-   &tuple_deleted, false, false);
+   &tuple_deleted, false, false, true);

  /*
  * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1292,6 +1299,11 @@ lreplace:;
  ereport(ERROR,
  (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to another partition
due to concurrent update")));

I have asked to change the message "tuple to be updated .." after
heap_lock_tuple call not in nodeModifyTable.c, so please revert the
message in nodeModifyTable.c.

Have you verified the changes in execReplication.c in some way? It is
not clear to me how do you ensure to set the special value
(InvalidBlockNumber) in CTID for delete operation via logical
replication? The logical replication worker uses function
ExecSimpleRelationDelete to perform Delete and there is no way it can
pass the correct value of row_moved in heap_delete. Am I missing
something due to which we don't need to do this?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#13)

2 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Tue, Jan 23, 2018 at 7:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 12, 2018 at 11:43 AM, amul sul <sulamul@gmail.com> wrote:

[....]

I have asked to change the message "tuple to be updated .." after
heap_lock_tuple call not in nodeModifyTable.c, so please revert the
message in nodeModifyTable.c.

Understood, fixed in the attached version, sorry I'd missed it.

Have you verified the changes in execReplication.c in some way? It is
not clear to me how do you ensure to set the special value
(InvalidBlockNumber) in CTID for delete operation via logical
replication? The logical replication worker uses function
ExecSimpleRelationDelete to perform Delete and there is no way it can
pass the correct value of row_moved in heap_delete. Am I missing
something due to which we don't need to do this?

You are correct, from ExecSimpleRelationDelete, heap_delete will always
receive row_moved = false and InvalidBlockNumber will never set.

I didn't found any test case to hit changes in execReplication.c. I am not sure
what should we suppose do here, and having LOG is how much worse either.

What do you think, should we add an assert like EvalPlanQualFetch() here or
the current LOG message is fine?

Thanks for the review.

Regards,
Amul Sul

Attachments:

0001-Invalidate-ip_blkid-v5-wip.patchapplication/octet-stream; name=0001-Invalidate-ip_blkid-v5-wip.patchDownload

From 96f4340ddaf77bf7a4171d2b36af3fa8014089c3 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Wed, 24 Jan 2018 10:28:15 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v5-wip

v5-wip: Update w.r.t Amit Kapila's comments[6].
 - Reverted error message in nodeModifyTable.c from 'tuple to be locked'
   to 'tuple to be updated'.

 - TODO:
 1. Yet to made a decision of having LOG/ELOG/ASSERT in the
    RelationFindReplTupleByIndex() and RelationFindReplTupleSeq().

v4: Rebased on "UPDATE of partition key v35" patch[5].

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
 - typo in all error message and comment : "to an another" -> "to another"
 - error message change : "tuple to be updated" -> "tuple to be locked"
 - In ExecOnConflictUpdate(), error report converted into assert &
  comments added.

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
 3] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
 4] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql
 5] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 6] https://postgr.es/m/CAA4eK1LHVnNWYF53F1gUGx6CTxuvznozvU-Lr-dfE=Qeu1gEcg@mail.gmail.com
---
 src/backend/access/heap/heapam.c       | 13 +++++++++++--
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/execReplication.c |  8 ++++++++
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 28 ++++++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/storage/itemptr.h          |  4 +++-
 8 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index be263850cd..f93d450416 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3017,6 +3017,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	row_moved - true iff the tuple is being moved to another partition
+ *				table due to an update of partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3032,7 +3034,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3300,6 +3302,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3425,7 +3434,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 160d941c00..a770531e14 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 410921cc40..98e198f0b7 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 32891abbdf..9016d8fb11 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -194,6 +194,10 @@ retry:
 				ereport(LOG,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -352,6 +356,10 @@ retry:
 				ereport(LOG,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 6c2f8d4ec0..f45e0accb4 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -711,7 +711,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tupleDeleted,
 		   bool processReturning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -802,7 +803,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -848,6 +850,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1150,7 +1157,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1295,6 +1302,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1465,6 +1477,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of partition key.
+			 */
+			Assert(BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2053,7 +2073,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..44a211a740 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..79dceb414f 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * partition key.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
-- 
2.14.1

0002-isolation-tests-v3.patchapplication/octet-stream; name=0002-isolation-tests-v3.patchDownload

From ccb21b225d37a608175ec6daf15ac0335cd4f9ab Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Fri, 12 Jan 2018 11:30:40 +0530
Subject: [PATCH 2/2] isolation tests v3

v3:
 - Rebase on "UPDATE of partition key v35" patch[2] and
  latest maste head[3].

v2:
 - Error message changed.
 - Can't add isolation test[1] for
 	RelationFindReplTupleByIndex & RelationFindReplTupleSeq
 - In ExecOnConflictUpdate, the error report is converted to assert
   check.

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

 ------------
  TODOs:
 ------------
 Tests for the following function yet to add.
 1. EvalPlanQualFetch

 ------------
  References:
 ------------
 1] https://postgr.es/m/CA+TgmoYsMRo2PHFTGUFifv4ZSCZ9LNJASbOyb=9it2=UA4j4vw@mail.gmail.com
 2] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 3] Commit id bdb70c12b3a2e69eec6e51411df60d9f43ecc841
---
 .../isolation/expected/partition-key-update-1.out  | 35 +++++++++++++++++++
 .../isolation/expected/partition-key-update-2.out  | 18 ++++++++++
 .../isolation/expected/partition-key-update-3.out  |  8 +++++
 src/test/isolation/isolation_schedule              |  3 ++
 .../isolation/specs/partition-key-update-1.spec    | 37 ++++++++++++++++++++
 .../isolation/specs/partition-key-update-2.spec    | 39 ++++++++++++++++++++++
 .../isolation/specs/partition-key-update-3.spec    | 30 +++++++++++++++++
 7 files changed, 170 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..56bf4450b0
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,35 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+
+starting permutation: s1u s2d s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+
+starting permutation: s2d s1u s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..195ec4cedf
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,18 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..1922bdce46
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,8 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59546..9bda495de3 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -66,3 +66,6 @@ test: async-notify
 test: vacuum-reltuples
 test: timeouts
 test: vacuum-concurrent-drop
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..db76c9a9b5
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,37 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
+
+permutation "s1u" "s1c" "s2d"
+permutation "s1u" "s2d" "s1c"
+permutation "s2d" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..b09e76ce21
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,39 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..c1f547d9ba
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,30 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+
+permutation "s1u3" "s2i" "s1c"
-- 
2.14.1

#15

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: amul sul (#14)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Wed, Jan 24, 2018 at 12:44 PM, amul sul <sulamul@gmail.com> wrote:

On Tue, Jan 23, 2018 at 7:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 12, 2018 at 11:43 AM, amul sul <sulamul@gmail.com> wrote:

[....]

I have asked to change the message "tuple to be updated .." after
heap_lock_tuple call not in nodeModifyTable.c, so please revert the
message in nodeModifyTable.c.

Understood, fixed in the attached version, sorry I'd missed it.

Have you verified the changes in execReplication.c in some way? It is
not clear to me how do you ensure to set the special value
(InvalidBlockNumber) in CTID for delete operation via logical
replication? The logical replication worker uses function
ExecSimpleRelationDelete to perform Delete and there is no way it can
pass the correct value of row_moved in heap_delete. Am I missing
something due to which we don't need to do this?

You are correct, from ExecSimpleRelationDelete, heap_delete will always
receive row_moved = false and InvalidBlockNumber will never set.

So, this means that in case of logical replication, it won't generate
the error this patch is trying to introduce. I think if we want to
handle this we need some changes in WAL and logical decoding as well.

Robert, others, what do you think? I am not very comfortable leaving
this unaddressed, if we don't want to do anything about it, at least
we should document it.

I didn't found any test case to hit changes in execReplication.c. I am not sure
what should we suppose do here, and having LOG is how much worse either.

I think you can manually (via debugger) hit this by using
PUBLICATION/SUBSCRIPTION syntax for logical replication. I think what
you need to do is in node-1, create a partitioned table and subscribe
it on node-2. Now, perform an Update on node-1, then stop the logical
replication worker before it calls heap_lock_tuple. Now, in node-2,
update the same row such that it moves the row. Now, continue the
logical replication worker. I think it should hit your new code, if
not then we need to think of some other way.

What do you think, should we add an assert like EvalPlanQualFetch() here or
the current LOG message is fine?

I think first let's try to hit this case.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#16

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#15)

2 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Hi Amit,

Sorry for the delayed response.

On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 24, 2018 at 12:44 PM, amul sul <sulamul@gmail.com> wrote:

On Tue, Jan 23, 2018 at 7:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 12, 2018 at 11:43 AM, amul sul <sulamul@gmail.com> wrote:

[....]

I think you can manually (via debugger) hit this by using
PUBLICATION/SUBSCRIPTION syntax for logical replication. I think what
you need to do is in node-1, create a partitioned table and subscribe
it on node-2. Now, perform an Update on node-1, then stop the logical
replication worker before it calls heap_lock_tuple. Now, in node-2,
update the same row such that it moves the row. Now, continue the
logical replication worker. I think it should hit your new code, if
not then we need to think of some other way.

I am able to hit the change log using above steps. Thanks a lot for the
step by step guide, I really needed that.

One strange behavior I found in the logical replication which is reproducible
without attached patch as well -- when I have updated on node2 by keeping
breakpoint before the heap_lock_tuple call in replication worker, I can see
a duplicate row was inserted on the node2, see this:

== NODE 1 ==

postgres=# insert into foo values(1, 'initial insert');
INSERT 0 1

postgres=# select tableoid::regclass, * from foo;
tableoid | a | b
----------+---+----------------
foo1 | 1 | initial insert
(1 row)

=== NODE 2 ==

postgres=# select tableoid::regclass, * from foo;
tableoid | a | b
----------+---+----------------
foo1 | 1 | initial insert
(1 row)

== NODE 1 ==

postgres=# update foo set a=2, b='node1_update' where a=1;
UPDATE 1

<---- BREAK POINT BEFORE heap_lock_tuple IN replication worker --->

== NODE 2 ==

postgres=# update foo set a=2, b='node2_update' where a=1;

<---- RELEASE BREAK POINT --->

postgres=# 2018-02-02 12:35:45.050 IST [91786] LOG: tuple to be
locked was already moved to another partition due to concurrent
update, retrying

postgres=# select tableoid::regclass, * from foo;
tableoid | a | b
----------+---+--------------
foo2 | 2 | node2_update
foo2 | 2 | node1_update
(2 rows)

== NODE 1 ==

postgres=# select tableoid::regclass, * from foo;
tableoid | a | b
----------+---+--------------
foo2 | 2 | node1_update
(1 row)

I am thinking to report this in a separate thread, but not sure if
this is already known behaviour or not.

== schema to reproduce above case ==
-- node1
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'initial insert');
CREATE PUBLICATION update_row_mov_pub FOR ALL TABLES;
ALTER TABLE foo REPLICA IDENTITY FULL;
ALTER TABLE foo1 REPLICA IDENTITY FULL;
ALTER TABLE foo2 REPLICA IDENTITY FULL;

-- node2
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
CREATE SUBSCRIPTION update_row_mov_sub CONNECTION 'host=localhost
dbname=postgres' PUBLICATION update_row_mov_pub;
== END==

Updated patch attached -- correct changes in execReplication.c.

Regards,
Amul Sul

Attachments:

0001-Invalidate-ip_blkid-v5-wip2.patchapplication/octet-stream; name=0001-Invalidate-ip_blkid-v5-wip2.patchDownload

From 17167b769f7b59f238fcc5b7e58dd4997a4f08f4 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Wed, 24 Jan 2018 10:28:15 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v5-wip2

v5-wip2:
 - Minor changes in RelationFindReplTupleByIndex() and
   RelationFindReplTupleSeq()

 - TODO;
   Same as the privious

v5-wip: Update w.r.t Amit Kapila's comments[6].
 - Reverted error message in nodeModifyTable.c from 'tuple to be locked'
   to 'tuple to be updated'.

 - TODO:
 1. Yet to made a decision of having LOG/ELOG/ASSERT in the
    RelationFindReplTupleByIndex() and RelationFindReplTupleSeq().

v4: Rebased on "UPDATE of partition key v35" patch[5].

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
 - typo in all error message and comment : "to an another" -> "to another"
 - error message change : "tuple to be updated" -> "tuple to be locked"
 - In ExecOnConflictUpdate(), error report converted into assert &
  comments added.

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
 3] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
 4] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql
 5] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 6] https://postgr.es/m/CAA4eK1LHVnNWYF53F1gUGx6CTxuvznozvU-Lr-dfE=Qeu1gEcg@mail.gmail.com
---
 src/backend/access/heap/heapam.c       | 13 +++++++++++--
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/execReplication.c | 26 ++++++++++++++++++--------
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 28 ++++++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/storage/itemptr.h          |  4 +++-
 8 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index be263850cd..f93d450416 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3017,6 +3017,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	row_moved - true iff the tuple is being moved to another partition
+ *				table due to an update of partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3032,7 +3034,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3300,6 +3302,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3425,7 +3434,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 160d941c00..a770531e14 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 410921cc40..98e198f0b7 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 32891abbdf..a072e09390 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -190,10 +190,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -348,10 +353,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 2a8ecbd830..3eec422adc 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -711,7 +711,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tupleDeleted,
 		   bool processReturning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -802,7 +803,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -848,6 +850,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1150,7 +1157,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1295,6 +1302,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1465,6 +1477,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of partition key.
+			 */
+			Assert(BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2054,7 +2074,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..44a211a740 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..79dceb414f 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * partition key.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
-- 
2.14.1

0002-isolation-tests-v3.patchapplication/octet-stream; name=0002-isolation-tests-v3.patchDownload

From 0068c000c8acad9865ae1e0331c88939683d43ea Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Fri, 12 Jan 2018 11:30:40 +0530
Subject: [PATCH 2/2] isolation tests v3

v3:
 - Rebase on "UPDATE of partition key v35" patch[2] and
  latest maste head[3].

v2:
 - Error message changed.
 - Can't add isolation test[1] for
 	RelationFindReplTupleByIndex & RelationFindReplTupleSeq
 - In ExecOnConflictUpdate, the error report is converted to assert
   check.

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

 ------------
  TODOs:
 ------------
 Tests for the following function yet to add.
 1. EvalPlanQualFetch

 ------------
  References:
 ------------
 1] https://postgr.es/m/CA+TgmoYsMRo2PHFTGUFifv4ZSCZ9LNJASbOyb=9it2=UA4j4vw@mail.gmail.com
 2] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 3] Commit id bdb70c12b3a2e69eec6e51411df60d9f43ecc841
---
 .../isolation/expected/partition-key-update-1.out  | 35 +++++++++++++++++++
 .../isolation/expected/partition-key-update-2.out  | 18 ++++++++++
 .../isolation/expected/partition-key-update-3.out  |  8 +++++
 src/test/isolation/isolation_schedule              |  3 ++
 .../isolation/specs/partition-key-update-1.spec    | 37 ++++++++++++++++++++
 .../isolation/specs/partition-key-update-2.spec    | 39 ++++++++++++++++++++++
 .../isolation/specs/partition-key-update-3.spec    | 30 +++++++++++++++++
 7 files changed, 170 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..56bf4450b0
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,35 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+
+starting permutation: s1u s2d s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+
+starting permutation: s2d s1u s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..195ec4cedf
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,18 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..1922bdce46
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,8 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59546..9bda495de3 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -66,3 +66,6 @@ test: async-notify
 test: vacuum-reltuples
 test: timeouts
 test: vacuum-concurrent-drop
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..db76c9a9b5
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,37 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
+
+permutation "s1u" "s1c" "s2d"
+permutation "s1u" "s2d" "s1c"
+permutation "s2d" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..b09e76ce21
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,39 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..c1f547d9ba
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,30 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+
+permutation "s1u3" "s2i" "s1c"
-- 
2.14.1

#17

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#15)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Jan 26, 2018 at 1:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

So, this means that in case of logical replication, it won't generate
the error this patch is trying to introduce. I think if we want to
handle this we need some changes in WAL and logical decoding as well.

Robert, others, what do you think? I am not very comfortable leaving
this unaddressed, if we don't want to do anything about it, at least
we should document it.

As I said on the other thread, I'm not sure how reasonable it really
is to try to do anything about this. For both the issue you raised
there, I think we'd need to introduce a new WAL record type that
represents a delete from one table and an insert to another that
should be considered as a single operation. I'm not keen on that idea,
but you can make an argument that it's the Right Thing To Do. I would
be more inclined, at least for v11, to just document that the
delete+insert will be replayed separately on replicas.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: amul sul (#16)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sulamul@gmail.com> wrote:

On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
[....]

I think you can manually (via debugger) hit this by using
PUBLICATION/SUBSCRIPTION syntax for logical replication. I think what
you need to do is in node-1, create a partitioned table and subscribe
it on node-2. Now, perform an Update on node-1, then stop the logical
replication worker before it calls heap_lock_tuple. Now, in node-2,
update the same row such that it moves the row. Now, continue the
logical replication worker. I think it should hit your new code, if
not then we need to think of some other way.

I am able to hit the change log using above steps. Thanks a lot for the
step by step guide, I really needed that.

One strange behavior I found in the logical replication which is reproducible
without attached patch as well -- when I have updated on node2 by keeping
breakpoint before the heap_lock_tuple call in replication worker, I can see
a duplicate row was inserted on the node2, see this:

I am thinking to report this in a separate thread, but not sure if
this is already known behaviour or not.

I think it is worth to discuss this behavior in a separate thread.
However, if possible, try to reproduce it without partitioning and
then report it.

Updated patch attached -- correct changes in execReplication.c.

Your changes look correct to me.

I wonder what will be the behavior of this patch with
wal_consistency_checking [1]https://www.postgresql.org/docs/devel/static/runtime-config-developer.html. I think it will generate a failure as
there is nothing in WAL to replay it. Can you once try it? If we see
a failure with wal consistency checker, then we need to think whether
(a) we want to deal with it by logging this information, or (b) do we
want to mask it or (c) something else?

[1]: https://www.postgresql.org/docs/devel/static/runtime-config-developer.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#18)

1 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sulamul@gmail.com> wrote:

On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
[....]

I think you can manually (via debugger) hit this by using
PUBLICATION/SUBSCRIPTION syntax for logical replication. I think what
you need to do is in node-1, create a partitioned table and subscribe
it on node-2. Now, perform an Update on node-1, then stop the logical
replication worker before it calls heap_lock_tuple. Now, in node-2,
update the same row such that it moves the row. Now, continue the
logical replication worker. I think it should hit your new code, if
not then we need to think of some other way.

I am able to hit the change log using above steps. Thanks a lot for the
step by step guide, I really needed that.

One strange behavior I found in the logical replication which is reproducible
without attached patch as well -- when I have updated on node2 by keeping
breakpoint before the heap_lock_tuple call in replication worker, I can see
a duplicate row was inserted on the node2, see this:

..

I am thinking to report this in a separate thread, but not sure if
this is already known behaviour or not.

I think it is worth to discuss this behavior in a separate thread.
However, if possible, try to reproduce it without partitioning and
then report it.

Logical replication behavior for the normal table is as expected, this happens
only with partition table, will start a new thread for this on hacker.

Updated patch attached -- correct changes in execReplication.c.

Your changes look correct to me.

I wonder what will be the behavior of this patch with
wal_consistency_checking [1]. I think it will generate a failure as
there is nothing in WAL to replay it. Can you once try it? If we see
a failure with wal consistency checker, then we need to think whether
(a) we want to deal with it by logging this information, or (b) do we
want to mask it or (c) something else?

[1] - https://www.postgresql.org/docs/devel/static/runtime-config-developer.html

Yes, you are correct standby stopped with a following error:

FATAL: inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
CONTEXT: WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
LOG: startup process (PID 22791) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down

I have tested warm standby replication setup using attached script. Without
wal_consistency_checking setting, it works fine & data from master to standby is
replicated as expected, if this guaranty is enough then I think could skip this
error from wal consistent check for such deleted tuple (I guess option
b that you have suggested), thoughts?

#20

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: amul sul (#19)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Tue, Feb 6, 2018 at 7:05 PM, amul sul <sulamul@gmail.com> wrote:

On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sulamul@gmail.com> wrote:

On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
[....]

I think you can manually (via debugger) hit this by using
PUBLICATION/SUBSCRIPTION syntax for logical replication. I think what
you need to do is in node-1, create a partitioned table and subscribe
it on node-2. Now, perform an Update on node-1, then stop the logical
replication worker before it calls heap_lock_tuple. Now, in node-2,
update the same row such that it moves the row. Now, continue the
logical replication worker. I think it should hit your new code, if
not then we need to think of some other way.

I am able to hit the change log using above steps. Thanks a lot for the
step by step guide, I really needed that.

One strange behavior I found in the logical replication which is reproducible
without attached patch as well -- when I have updated on node2 by keeping
breakpoint before the heap_lock_tuple call in replication worker, I can see
a duplicate row was inserted on the node2, see this:

..

I am thinking to report this in a separate thread, but not sure if
this is already known behaviour or not.

I think it is worth to discuss this behavior in a separate thread.
However, if possible, try to reproduce it without partitioning and
then report it.

Logical replication behavior for the normal table is as expected, this happens
only with partition table, will start a new thread for this on hacker.

Posted on hackers :
/messages/by-id/CAAJ_b94bYxLsX0erZXVH-anQPbWqcYUPWX4xVRa1YJY=Ph60ZQ@mail.gmail.com

Regards,
Amul Sul

#21

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Robert Haas (#17)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Sat, Feb 3, 2018 at 4:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 26, 2018 at 1:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

So, this means that in case of logical replication, it won't generate
the error this patch is trying to introduce. I think if we want to
handle this we need some changes in WAL and logical decoding as well.

Robert, others, what do you think? I am not very comfortable leaving
this unaddressed, if we don't want to do anything about it, at least
we should document it.

As I said on the other thread, I'm not sure how reasonable it really
is to try to do anything about this. For both the issue you raised
there, I think we'd need to introduce a new WAL record type that
represents a delete from one table and an insert to another that
should be considered as a single operation.

I think to solve the issue in this thread, a flag should be sufficient
that can be used in logical replication InvalidBlockNumber in CTID for
Deletes.

I'm not keen on that idea,
but you can make an argument that it's the Right Thing To Do. I would
be more inclined, at least for v11, to just document that the
delete+insert will be replayed separately on replicas.

Even if we do what you are suggesting, we need something in WAL
(probably a flag to indicate this special type of Delete), otherwise,
wal consistency checker will fail. Another idea would be to mask the
ctid change so that wal consistency checker doesn't cry.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#22

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: amul sul (#19)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Tue, Feb 6, 2018 at 7:05 PM, amul sul <sulamul@gmail.com> wrote:

On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sulamul@gmail.com> wrote:

On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[....]

I wonder what will be the behavior of this patch with
wal_consistency_checking [1]. I think it will generate a failure as
there is nothing in WAL to replay it. Can you once try it? If we see
a failure with wal consistency checker, then we need to think whether
(a) we want to deal with it by logging this information, or (b) do we
want to mask it or (c) something else?

[1] - https://www.postgresql.org/docs/devel/static/runtime-config-developer.html

Yes, you are correct standby stopped with a following error:

FATAL: inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
CONTEXT: WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
LOG: startup process (PID 22791) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down

I have tested warm standby replication setup using attached script. Without
wal_consistency_checking setting, it works fine & data from master to standby is
replicated as expected, if this guaranty is enough then I think could skip this
error from wal consistent check for such deleted tuple (I guess option
b that you have suggested), thoughts?

I tried to mask ctid.ip_blkid if it is set to InvalidBlockId with
following change in heap_mask:

------------- PATCH -------------
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 682f4f07a8..e7c011f9a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9323,6 +9323,10 @@ heap_mask(char *pagedata, BlockNumber blkno)
             */
            if (HeapTupleHeaderIsSpeculative(page_htup))
                ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+           /* TODO : comments ?  */
+           if (!BlockNumberIsValid(BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)))))
+               BlockIdSet(&((page_htup->t_ctid).ip_blkid), blkno);
        }

/*
------------- END -------------

Test script[1]/messages/by-id/CAAJ_b94_29wiUA83W8LQjtfjv9XNV=+PT8+ioWRPjnnFHe3eqw@mail.gmail.com works as expected with this change but I don't have much
confident on it due to lack of knowledge of wal_consistency_checking
routine. Any suggestion/comments will be much appreciated, thanks!

[1]: /messages/by-id/CAAJ_b94_29wiUA83W8LQjtfjv9XNV=+PT8+ioWRPjnnFHe3eqw@mail.gmail.com

Regards,
Amul

#23

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: amul sul (#22)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Wed, Feb 7, 2018 at 6:13 PM, amul sul <sulamul@gmail.com> wrote:

On Tue, Feb 6, 2018 at 7:05 PM, amul sul <sulamul@gmail.com> wrote:

On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yes, you are correct standby stopped with a following error:

FATAL: inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
CONTEXT: WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
LOG: startup process (PID 22791) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down

I have tested warm standby replication setup using attached script. Without
wal_consistency_checking setting, it works fine & data from master to standby is
replicated as expected, if this guaranty is enough then I think could skip this
error from wal consistent check for such deleted tuple (I guess option
b that you have suggested), thoughts?

I tried to mask ctid.ip_blkid if it is set to InvalidBlockId with
following change in heap_mask:

Your change appears fine to me. I think one can set both block number
and offset as we do for HeapTupleHeaderIsSpeculative, but the way you
have done it looks good to me. Kindly include it in the next version
of your patch by adding the missing comment.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#24

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#23)

2 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Tue, Feb 13, 2018 at 11:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 7, 2018 at 6:13 PM, amul sul <sulamul@gmail.com> wrote:

On Tue, Feb 6, 2018 at 7:05 PM, amul sul <sulamul@gmail.com> wrote:

On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yes, you are correct standby stopped with a following error:

FATAL: inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
CONTEXT: WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
LOG: startup process (PID 22791) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down

I have tested warm standby replication setup using attached script. Without
wal_consistency_checking setting, it works fine & data from master to standby is
replicated as expected, if this guaranty is enough then I think could skip this
error from wal consistent check for such deleted tuple (I guess option
b that you have suggested), thoughts?

I tried to mask ctid.ip_blkid if it is set to InvalidBlockId with
following change in heap_mask:

Your change appears fine to me. I think one can set both block number
and offset as we do for HeapTupleHeaderIsSpeculative, but the way you
have done it looks good to me. Kindly include it in the next version
of your patch by adding the missing comment.

Thanks for the confirmation, updated patch attached.

Regards,
Amul

Attachments:

0001-Invalidate-ip_blkid-v5.patchapplication/octet-stream; name=0001-Invalidate-ip_blkid-v5.patchDownload

From 08c8c7ece7d9411e70a780dbeed89d81419db6b6 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:52 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v5

v5:
 - Added code in heap_mask to skip wal_consistency_checking[7]
 - Fixed previous todos.

v5-wip2:
 - Minor changes in RelationFindReplTupleByIndex() and
   RelationFindReplTupleSeq()

 - TODO;
   Same as the privious

v5-wip: Update w.r.t Amit Kapila's comments[6].
 - Reverted error message in nodeModifyTable.c from 'tuple to be locked'
   to 'tuple to be updated'.

 - TODO:
 1. Yet to made a decision of having LOG/ELOG/ASSERT in the
    RelationFindReplTupleByIndex() and RelationFindReplTupleSeq().

v4: Rebased on "UPDATE of partition key v35" patch[5].

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
 - typo in all error message and comment : "to an another" -> "to another"
 - error message change : "tuple to be updated" -> "tuple to be locked"
 - In ExecOnConflictUpdate(), error report converted into assert &
  comments added.

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
 3] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
 4] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql
 5] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 6] https://postgr.es/m/CAA4eK1LHVnNWYF53F1gUGx6CTxuvznozvU-Lr-dfE=Qeu1gEcg@mail.gmail.com
 7] https://postgr.es/m/CAAJ_b94_29wiUA83W8LQjtfjv9XNV=+PT8+ioWRPjnnFHe3eqw@mail.gmail.com
---
 src/backend/access/heap/heapam.c       | 25 +++++++++++++++++++++++--
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  4 ++++
 src/backend/executor/execReplication.c | 26 ++++++++++++++++++--------
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 28 ++++++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/storage/itemptr.h          |  4 +++-
 8 files changed, 83 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8a846e7dba..f4560ee9cb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3037,6 +3037,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	row_moved - true iff the tuple is being moved to another partition
+ *				table due to an update of partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3052,7 +3054,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool row_moved)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3320,6 +3322,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3445,7 +3454,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -9314,6 +9323,18 @@ heap_mask(char *pagedata, BlockNumber blkno)
 			 */
 			if (HeapTupleHeaderIsSpeculative(page_htup))
 				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+			/*
+			 * For a deleted tuple, a block identifier is set to the
+			 * InvalidBlockNumber to indicate that the tuple has been moved to
+			 * another partition due to an update of partition key.
+			 *
+			 * Like speculative tuple, to ignore any inconsistency set block
+			 * identifier to current block number.
+			 */
+			if (!BlockNumberIsValid(
+					BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)))))
+				BlockIdSet(&((page_htup->t_ctid).ip_blkid), blkno);
 		}
 
 		/*
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 160d941c00..a770531e14 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 5d3e923cca..4c68b114d4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2712,6 +2712,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 32891abbdf..a072e09390 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -190,10 +190,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -348,10 +353,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 2a8ecbd830..3eec422adc 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -711,7 +711,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tupleDeleted,
 		   bool processReturning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool row_moved)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -802,7 +803,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 row_moved);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -848,6 +850,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1150,7 +1157,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1295,6 +1302,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1465,6 +1477,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of partition key.
+			 */
+			Assert(BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2054,7 +2074,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..44a211a740 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool row_moved);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..79dceb414f 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * partition key.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
-- 
2.14.1

0002-isolation-tests-v4.patchapplication/octet-stream; name=0002-isolation-tests-v4.patchDownload

From 226ec72269cf4bebb6576f23e7b88dfabe6aea16 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:33 +0530
Subject: [PATCH 2/2] isolation tests v4

v4:
 - Rebased on Invalidate ip_blkid v5.

v3:
 - Rebase on "UPDATE of partition key v35" patch[2] and
  latest maste head[3].

v2:
 - Error message changed.
 - Can't add isolation test[1] for
 	RelationFindReplTupleByIndex & RelationFindReplTupleSeq
 - In ExecOnConflictUpdate, the error report is converted to assert
   check.

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

 ------------
  References:
 ------------
 1] https://postgr.es/m/CA+TgmoYsMRo2PHFTGUFifv4ZSCZ9LNJASbOyb=9it2=UA4j4vw@mail.gmail.com
 2] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 3] Commit id bdb70c12b3a2e69eec6e51411df60d9f43ecc841
---
 .../isolation/expected/partition-key-update-1.out  | 35 +++++++++++++++++++
 .../isolation/expected/partition-key-update-2.out  | 18 ++++++++++
 .../isolation/expected/partition-key-update-3.out  |  8 +++++
 src/test/isolation/isolation_schedule              |  3 ++
 .../isolation/specs/partition-key-update-1.spec    | 37 ++++++++++++++++++++
 .../isolation/specs/partition-key-update-2.spec    | 39 ++++++++++++++++++++++
 .../isolation/specs/partition-key-update-3.spec    | 30 +++++++++++++++++
 7 files changed, 170 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..56bf4450b0
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,35 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+
+starting permutation: s1u s2d s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+
+starting permutation: s2d s1u s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..195ec4cedf
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,18 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..1922bdce46
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,8 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59546..9bda495de3 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -66,3 +66,6 @@ test: async-notify
 test: vacuum-reltuples
 test: timeouts
 test: vacuum-concurrent-drop
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..db76c9a9b5
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,37 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
+
+permutation "s1u" "s1c" "s2d"
+permutation "s1u" "s2d" "s1c"
+permutation "s2d" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..b09e76ce21
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,39 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..c1f547d9ba
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,30 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+
+permutation "s1u3" "s2i" "s1c"
-- 
2.14.1

#25

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: amul sul (#24)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:

On Tue, Feb 13, 2018 at 11:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Your change appears fine to me. I think one can set both block number
and offset as we do for HeapTupleHeaderIsSpeculative, but the way you
have done it looks good to me. Kindly include it in the next version
of your patch by adding the missing comment.

Thanks for the confirmation, updated patch attached.

+# Concurrency error from GetTupleForTrigger
+# Concurrency error from ExecLockRows

I think you don't need to mention above sentences in spec files.
Apart from that, your patch looks good to me. I have marked it as
Ready For Committer.

Notes for Committer -
1. We might need some changes in update-tuple-routing mechanism if we
decide to do anything for the bug [1]/messages/by-id/CAAJ_b94bYxLsX0erZXVH-anQPbWqcYUPWX4xVRa1YJY=Ph60ZQ@mail.gmail.com discussed in the nearby thread,
but as that is not directly related to this patch, we can move ahead.
2. I think it is better to document that for update tuple routing the
delete+insert will be replayed separately on replicas. I leave this
to the discretion of the committer.

[1]: /messages/by-id/CAAJ_b94bYxLsX0erZXVH-anQPbWqcYUPWX4xVRa1YJY=Ph60ZQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#26

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

almost 8 years ago

In reply to: Amit Kapila (#25)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Wed, Feb 14, 2018 at 5:44 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

+# Concurrency error from GetTupleForTrigger
+# Concurrency error from ExecLockRows
I think you don't need to mention above sentences in spec files.
Apart from that, your patch looks good to me. I have marked it as
Ready For Committer.

I too have tested this feature with isolation framework and this look good
to me.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

#27

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: amul sul (#24)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Hi,

On 2018-02-13 12:41:26 +0530, amul sul wrote:

From 08c8c7ece7d9411e70a780dbeed89d81419db6b6 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:52 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v5

v5:
- Added code in heap_mask to skip wal_consistency_checking[7]
- Fixed previous todos.

v5-wip2:
- Minor changes in RelationFindReplTupleByIndex() and
RelationFindReplTupleSeq()

- TODO;
Same as the privious

v5-wip: Update w.r.t Amit Kapila's comments[6].
- Reverted error message in nodeModifyTable.c from 'tuple to be locked'
to 'tuple to be updated'.

- TODO:
1. Yet to made a decision of having LOG/ELOG/ASSERT in the
RelationFindReplTupleByIndex() and RelationFindReplTupleSeq().

v4: Rebased on "UPDATE of partition key v35" patch[5].

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
- typo in all error message and comment : "to an another" -> "to another"
- error message change : "tuple to be updated" -> "tuple to be locked"
- In ExecOnConflictUpdate(), error report converted into assert &
comments added.

v2: Updated w.r.t Robert review comments[2]
- Updated couple of comment of heap_delete argument and ItemPointerData
- Added same concurrent update error logic in ExecOnConflictUpdate,
RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
- When tuple is being moved to another partition then ip_blkid in the
tuple header mark to InvalidBlockNumber.

Very nice and instructive to keep this in a submission's commit message.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8a846e7dba..f4560ee9cb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3037,6 +3037,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*	crosscheck - if not InvalidSnapshot, also check tuple against this
*	wait - true if should wait for any conflicting update to commit/abort
*	hufd - output parameter, filled in failure cases (see below)
+ *	row_moved - true iff the tuple is being moved to another partition
+ *				table due to an update of partition key. Otherwise, false.
*

I don't think 'row_moved' is a good variable name for this. Moving a row
in our heap format can mean a lot of things. Maybe 'to_other_part' or
'changing_part'?

+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (row_moved)
+		BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);

The parens here are set in a bit werid way. I assume that's from copying
it out of ItemPointerSet()? Why aren't you just using ItemPointerSetBlockNumber()?

I think it'd be better if we followed the example of specultive inserts
and created an equivalent of HeapTupleHeaderSetSpeculativeToken. That'd
be a heck of a lot easier to grep for...

@@ -9314,6 +9323,18 @@ heap_mask(char *pagedata, BlockNumber blkno)
*/
if (HeapTupleHeaderIsSpeculative(page_htup))
ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+			/*
+			 * For a deleted tuple, a block identifier is set to the

I think this 'the' is superflous.

+			 * InvalidBlockNumber to indicate that the tuple has been moved to
+			 * another partition due to an update of partition key.

But I think it should be 'the partition key'.

+			 * Like speculative tuple, to ignore any inconsistency set block
+			 * identifier to current block number.

This doesn't quite parse.

+			 */
+			if (!BlockNumberIsValid(
+					BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)))))
+				BlockIdSet(&((page_htup->t_ctid).ip_blkid), blkno);
}

That formatting looks wrong. I think it should be replaced by a macro
like mentioned above.

/*
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 160d941c00..a770531e14 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));

Yes, given that we repeat this in multiple places, I *definitely* want
to see this wrapped in a macro with a descriptive name.

diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+				if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+

Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
ERRCODE_T_R_SERIALIZATION_FAILURE? A lot of frameworks have builtin
logic to retry serialization failures, and this kind of thing is going
to resolved by retrying, no?

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644

I'd like to see tests that show various interactions with ON CONFLICT.

Greetings,

Andres Freund

#28

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Andres Freund (#27)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+                                     ereport(ERROR,
+                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                                      errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+

I think it depends, in some cases retry can help in deleting the
required tuple, but in other cases like when the user tries to perform
delete on a particular partition table, it won't be successful as the
tuple would have been moved.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#29

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Rajkumar Raghuwanshi (#26)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Wed, Feb 28, 2018 at 12:38 PM, Rajkumar Raghuwanshi <
rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Wed, Feb 14, 2018 at 5:44 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
+# Concurrency error from GetTupleForTrigger
+# Concurrency error from ExecLockRows
I think you don't need to mention above sentences in spec files.
Apart from that, your patch looks good to me. I have marked it as
Ready For Committer.
I too have tested this feature with isolation framework and this look good
to me.

It looks to me that we are trying to fix only one issue here with
concurrent updates. What happens if a non-partition key is first updated
and then a second session updates the partition key?

For example, with your patches applied:

CREATE TABLE pa_target (key integer, val text)
PARTITION BY LIST (key);
CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
INSERT INTO pa_target VALUES (1, 'initial1');

session1:
BEGIN;
UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
UPDATE 1
postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
1 | initial1 *updated by update1*
(1 row)

session2:
UPDATE pa_target SET val = val || ' updated by update2', key = key + 1
WHERE key = 1
<blocks>

session1:
postgres=# COMMIT;
COMMIT

postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
2 | initial1 updated by update2
(1 row)

Ouch. The committed updates by session1 are overwritten by session2. This
clearly violates the rules that rest of the system obeys and is not
acceptable IMHO.

Clearly, ExecUpdate() while moving rows between partitions is missing out
on re-constructing the to-be-updated tuple, based on the latest tuple in
the update chain. Instead, it's simply deleting the latest tuple and
inserting a new tuple in the new partition based on the old tuple. That's
simply wrong.

I haven't really thought carefully to see if this should be a separate
patch, but it warrants attention. We should at least think through all
different concurrency aspects of partition key updates and think about a
holistic solution, instead of fixing one problem at a time. This probably
also shows that isolation tests for partition key updates are either
missing (I haven't checked) or they need more work.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#30

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: amul sul (#24)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:

Thanks for the confirmation, updated patch attached.

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
not do anything to deal with the fact that t_ctid may no longer point to
itself to mark end of the chain. I just can't see how that would work. But
if it does, it needs good amount of comments explaining why and most likely
updating comments at other places where chain following is done. For
example, how's this code in heap_get_latest_tid() is still valid? Aren't we
setting "ctid" to some invalid value here?

2302 /*
2303 * If there's a valid t_ctid link, follow it, else we're done.
2304 */
2305 if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
2306 HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
2307 ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
2308 {
2309 UnlockReleaseBuffer(buffer);
2310 break;
2311 }
2312
2313 ctid = tp.t_data->t_ctid;

This is just one example. I am almost certain there are many such cases
that will require careful attention.

What happens if a partition key update deletes a row, but the operation is
aborted? Do we need any special handling for that case?

I am actually worried that we're tinkering with ip_blkid to handle one
corner case of detecting partition key update. This is going to change
on-disk format and probably need more careful attention. Are we certain
that we would never require update-chain following when partition keys are
updated? If so, can we think about some other mechanism which actually even
leaves behind <new_partition, new_ctid>? I am not saying we should do that,
but it warrants a thought. May be it was discussed somewhere else and ruled
out. I happened to notice this patch because of the bug I encountered.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#31

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#29)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On 8 March 2018 at 09:15, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

For example, with your patches applied:

CREATE TABLE pa_target (key integer, val text)
PARTITION BY LIST (key);
CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
INSERT INTO pa_target VALUES (1, 'initial1');

session1:
BEGIN;
UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
UPDATE 1
postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
1 | initial1 updated by update1
(1 row)

session2:
UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE
key = 1
<blocks>

session1:
postgres=# COMMIT;
COMMIT

<session1 unblocks and completes its UPDATE>

postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
2 | initial1 updated by update2
(1 row)

Ouch. The committed updates by session1 are overwritten by session2. This
clearly violates the rules that rest of the system obeys and is not
acceptable IMHO.

Clearly, ExecUpdate() while moving rows between partitions is missing out on
re-constructing the to-be-updated tuple, based on the latest tuple in the
update chain. Instead, it's simply deleting the latest tuple and inserting a
new tuple in the new partition based on the old tuple. That's simply wrong.

You are right. This need to be fixed. This is a different issue than
the particular one that is being worked upon in this thread, and both
these issues have different fixes.

Like you said, the tuple needs to be reconstructed when ExecDelete()
finds that the row has been updated by another transaction. We should
send back this information from ExecDelete() (I think tupleid
parameter gets updated in this case), and then in ExecUpdate() we
should goto lreplace, so that the the row is fetched back similar to
how it happens when heap_update() knows that the tuple was updated.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#32

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#30)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:

Thanks for the confirmation, updated patch attached.

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
do anything to deal with the fact that t_ctid may no longer point to itself
to mark end of the chain. I just can't see how that would work.

I think it is not that patch doesn't care about the end of the chain.
For example, ctid pointing to itself is used to ensure that for
deleted rows, nothing more needs to be done like below check in the
ExecUpdate/ExecDelete code path.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

It will deal with such cases by checking invalidblockid before these
checks. So, we should be fine in such cases.

But if it
does, it needs good amount of comments explaining why and most likely
updating comments at other places where chain following is done. For
example, how's this code in heap_get_latest_tid() is still valid? Aren't we
setting "ctid" to some invalid value here?

2302 /*
2303 * If there's a valid t_ctid link, follow it, else we're done.
2304 */
2305 if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
2306 HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
2307 ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
2308 {
2309 UnlockReleaseBuffer(buffer);
2310 break;
2311 }
2312
2313 ctid = tp.t_data->t_ctid;

I have not tested, but it seems this could be problematic, but I feel
we could deal with such cases by checking invalid block id in the
above if check. Another one such case is in EvalPlanQualFetch

This is just one example. I am almost certain there are many such cases that
will require careful attention.

Right, I think we should be able to detect and fix such cases.

What happens if a partition key update deletes a row, but the operation is
aborted? Do we need any special handling for that case?

If the transaction is aborted than future updates would update the
ctid to a new row, do you see any problem with it?

I am actually worried that we're tinkering with ip_blkid to handle one
corner case of detecting partition key update. This is going to change
on-disk format and probably need more careful attention. Are we certain that
we would never require update-chain following when partition keys are
updated?

I think we should never need update-chain following when the row is
moved from one partition to another partition, otherwise, we don't
change anything on the tuple.

If so, can we think about some other mechanism which actually even
leaves behind <new_partition, new_ctid>? I am not saying we should do that,
but it warrants a thought.

Oh, this would much bigger disk-format change and need much more
thoughts, where will we store new partition information.

May be it was discussed somewhere else and ruled
out.

There were a couple of other options discussed in the original thread
"UPDATE of partition key". One of them was to have an additional bit
on the tuple, but we found reusing ctid a better approach.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#33

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#31)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 11:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 8 March 2018 at 09:15, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

For example, with your patches applied:

CREATE TABLE pa_target (key integer, val text)
PARTITION BY LIST (key);
CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
INSERT INTO pa_target VALUES (1, 'initial1');

session1:
BEGIN;
UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
UPDATE 1
postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
1 | initial1 updated by update1
(1 row)

session2:
UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE
key = 1
<blocks>

session1:
postgres=# COMMIT;
COMMIT

<session1 unblocks and completes its UPDATE>

postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
2 | initial1 updated by update2
(1 row)

Ouch. The committed updates by session1 are overwritten by session2. This
clearly violates the rules that rest of the system obeys and is not
acceptable IMHO.

Clearly, ExecUpdate() while moving rows between partitions is missing out on
re-constructing the to-be-updated tuple, based on the latest tuple in the
update chain. Instead, it's simply deleting the latest tuple and inserting a
new tuple in the new partition based on the old tuple. That's simply wrong.

You are right. This need to be fixed. This is a different issue than
the particular one that is being worked upon in this thread, and both
these issues have different fixes.

I also think that this is a bug in the original patch and won't be
directly related to the patch being discussed.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#34

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#33)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On 8 March 2018 at 12:34, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 8, 2018 at 11:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 8 March 2018 at 09:15, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

For example, with your patches applied:

CREATE TABLE pa_target (key integer, val text)
PARTITION BY LIST (key);
CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
INSERT INTO pa_target VALUES (1, 'initial1');

session1:
BEGIN;
UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
UPDATE 1
postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
1 | initial1 updated by update1
(1 row)

session2:
UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE
key = 1
<blocks>

session1:
postgres=# COMMIT;
COMMIT

<session1 unblocks and completes its UPDATE>

postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
2 | initial1 updated by update2
(1 row)

Ouch. The committed updates by session1 are overwritten by session2. This
clearly violates the rules that rest of the system obeys and is not
acceptable IMHO.

Clearly, ExecUpdate() while moving rows between partitions is missing out on
re-constructing the to-be-updated tuple, based on the latest tuple in the
update chain. Instead, it's simply deleting the latest tuple and inserting a
new tuple in the new partition based on the old tuple. That's simply wrong.

You are right. This need to be fixed. This is a different issue than
the particular one that is being worked upon in this thread, and both
these issues have different fixes.

I also think that this is a bug in the original patch and won't be
directly related to the patch being discussed.

Yes. Will submit a patch for this in a separate thread.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#35

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#32)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:

Thanks for the confirmation, updated patch attached.

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch

does not

do anything to deal with the fact that t_ctid may no longer point to

itself

to mark end of the chain. I just can't see how that would work.

I think it is not that patch doesn't care about the end of the chain.
For example, ctid pointing to itself is used to ensure that for
deleted rows, nothing more needs to be done like below check in the
ExecUpdate/ExecDelete code path.

Yeah, but it only looks for places where it needs to detect deleted tuples
and thus wants to throw an error. I am worried about other places where it
is assumed that the ctid points to a valid looking tid, self or otherwise.
I see no such places being either updated or commented.

Now may be there is no danger because of other protections in place, but it
looks hazardous.

I have not tested, but it seems this could be problematic, but I feel
we could deal with such cases by checking invalid block id in the
above if check. Another one such case is in EvalPlanQualFetch

Right.

What happens if a partition key update deletes a row, but the operation

is

aborted? Do we need any special handling for that case?

If the transaction is aborted than future updates would update the
ctid to a new row, do you see any problem with it?

I don't know. May be there is none. But it needs to explained why it's not
a problem.

I am actually worried that we're tinkering with ip_blkid to handle one
corner case of detecting partition key update. This is going to change
on-disk format and probably need more careful attention. Are we certain

that

we would never require update-chain following when partition keys are
updated?

I think we should never need update-chain following when the row is
moved from one partition to another partition, otherwise, we don't
change anything on the tuple.

I am not sure I follow. I understand that it's probably a tough problem to
follow update chain from one partition to another. But why do you think we
would never need that? What if someone wants to improve on the restriction
this patch is imposing and actually implement partition key UPDATEs the way
we do for regular tables i.e. instead of throwing error, we actually
update/delete the row in the new partition?

If so, can we think about some other mechanism which actually even
leaves behind <new_partition, new_ctid>? I am not saying we should do

that,

but it warrants a thought.

Oh, this would much bigger disk-format change and need much more
thoughts, where will we store new partition information.

Yeah, but the disk format will change probably change just once. Or may be
this can be done local to a partition table without requiring any disk
format changes? Like adding a nullable hidden column in each partition to
store the forward pointer?

Thanks,
Pavan

#36

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#35)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 12:52 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:

Thanks for the confirmation, updated patch attached.

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
not
do anything to deal with the fact that t_ctid may no longer point to
itself
to mark end of the chain. I just can't see how that would work.

I think it is not that patch doesn't care about the end of the chain.
For example, ctid pointing to itself is used to ensure that for
deleted rows, nothing more needs to be done like below check in the
ExecUpdate/ExecDelete code path.

Yeah, but it only looks for places where it needs to detect deleted tuples
and thus wants to throw an error. I am worried about other places where it
is assumed that the ctid points to a valid looking tid, self or otherwise. I
see no such places being either updated or commented.

Now may be there is no danger because of other protections in place, but it
looks hazardous.

Right, I feel we need some tests to prove it, I think as per code I
can see we need checks in few more places (like the ones mentioned in
the previous email) apart from where this patch has added.

I have not tested, but it seems this could be problematic, but I feel
we could deal with such cases by checking invalid block id in the
above if check. Another one such case is in EvalPlanQualFetch

Right.

Amul, can you please look into the scenario being discussed and see if
you can write a test to see the behavior.

What happens if a partition key update deletes a row, but the operation
is
aborted? Do we need any special handling for that case?

If the transaction is aborted than future updates would update the
ctid to a new row, do you see any problem with it?

I don't know. May be there is none. But it needs to explained why it's not a
problem.

Sure, I guess in that case, we need to update in comments why it would
be okay after abort.

I am actually worried that we're tinkering with ip_blkid to handle one
corner case of detecting partition key update. This is going to change
on-disk format and probably need more careful attention. Are we certain
that
we would never require update-chain following when partition keys are
updated?

I think we should never need update-chain following when the row is
moved from one partition to another partition, otherwise, we don't
change anything on the tuple.

I am not sure I follow. I understand that it's probably a tough problem to
follow update chain from one partition to another. But why do you think we
would never need that? What if someone wants to improve on the restriction
this patch is imposing and actually implement partition key UPDATEs the way
we do for regular tables i.e. instead of throwing error, we actually
update/delete the row in the new partition?

I think even if we want to uplift this restriction, storing ctid link
of another partition appears to be a major change somebody would like
to do for this feature. We had some discussion on this matter earlier
where Robert, Greg seems to have said something like that as well.
See [1]/messages/by-id/CAM-w4HPis7rbnwi+oXjnouqMSRAC5DeVcMdxEXTMfDos1kaYPQ@mail.gmail.com[2]/messages/by-id/CA+TgmoY1W-jaS0vH8f=5xKQB3EWj5L0XcBf6P7WB7JqbKB3tSQ@mail.gmail.com. I think one way could be if updates/deletes, encounter
InvalidBlkID, they can use metadata of partition table to refind the
row. We already had a discussion on this point in the original thread
"UPDATE of partition key" and agreed to throw an error as the better
way to deal with it.

[1]: /messages/by-id/CAM-w4HPis7rbnwi+oXjnouqMSRAC5DeVcMdxEXTMfDos1kaYPQ@mail.gmail.com
[2]: /messages/by-id/CA+TgmoY1W-jaS0vH8f=5xKQB3EWj5L0XcBf6P7WB7JqbKB3tSQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#37

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#36)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 3:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 8, 2018 at 12:52 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:

Thanks for the confirmation, updated patch attached.

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
not
do anything to deal with the fact that t_ctid may no longer point to
itself
to mark end of the chain. I just can't see how that would work.

I think it is not that patch doesn't care about the end of the chain.
For example, ctid pointing to itself is used to ensure that for
deleted rows, nothing more needs to be done like below check in the
ExecUpdate/ExecDelete code path.

Yeah, but it only looks for places where it needs to detect deleted tuples
and thus wants to throw an error. I am worried about other places where it
is assumed that the ctid points to a valid looking tid, self or otherwise. I
see no such places being either updated or commented.

Now may be there is no danger because of other protections in place, but it
looks hazardous.

Right, I feel we need some tests to prove it, I think as per code I
can see we need checks in few more places (like the ones mentioned in
the previous email) apart from where this patch has added.

I have not tested, but it seems this could be problematic, but I feel
we could deal with such cases by checking invalid block id in the
above if check. Another one such case is in EvalPlanQualFetch

Right.

Amul, can you please look into the scenario being discussed and see if
you can write a test to see the behavior.

Sure, I'll try.

Regards,
Amu

#38

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#32)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

[.....]

But if it
does, it needs good amount of comments explaining why and most likely
updating comments at other places where chain following is done. For
example, how's this code in heap_get_latest_tid() is still valid? Aren't we
setting "ctid" to some invalid value here?

2302 /*
2303 * If there's a valid t_ctid link, follow it, else we're done.
2304 */
2305 if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
2306 HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
2307 ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
2308 {
2309 UnlockReleaseBuffer(buffer);
2310 break;
2311 }
2312
2313 ctid = tp.t_data->t_ctid;

I have not tested, but it seems this could be problematic, but I feel
we could deal with such cases by checking invalid block id in the
above if check. Another one such case is in EvalPlanQualFetch

I tried the following test to hit this code and found that the situation is not
that much unpleasant.

heap_get_latest_tid() will follow the chain and return latest tid iff the
current tuple satisfies visibility check (via HeapTupleSatisfiesVisibility), in
our case it doesn't and we are safe here, but I agree with Amit -- it is better
to add invalid block id check.
In EvalPlanQualFetch() invalid block id check already there before
ItemPointerEquals call.

=== TEST ==
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'Initial record');
update foo set b= b || ' -> update1' where a=1;
update foo set b= b || ' -> update2' where a=1;

postgres=# select tableoid::regclass, ctid, * from foo;
tableoid | ctid | a | b
----------+-------+---+--------------------------------------
foo1 | (0,3) | 1 | Initial record -> update1 -> update2
(1 row)

postgres=# select currtid2('foo1','(0,1)');
currtid2
----------
(0,3)
(1 row)

postgres=# select tableoid::regclass, ctid, * from foo;
tableoid | ctid | a | b
----------+-------+---+----------------------------------------------
foo2 | (0,1) | 2 | Initial record -> update1 -> update2-> moved
(1 row)

postgres=# select currtid2('foo1','(0,1)');
currtid2
----------
(0,1)
(1 row)

=== END ===

This is just one example. I am almost certain there are many such cases that
will require careful attention.

Right, I think we should be able to detect and fix such cases.

Will look into the places carefully where ItemPointerEquals() call
made for heap tuple.

Regards,
Amul

#39

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#30)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 12:34 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
do anything to deal with the fact that t_ctid may no longer point to itself
to mark end of the chain. I just can't see how that would work. But if it
does, it needs good amount of comments explaining why and most likely
updating comments at other places where chain following is done. For
example, how's this code in heap_get_latest_tid() is still valid? Aren't we
setting "ctid" to some invalid value here?

So the general idea of the patch is that this new kind of marking
marks the CTID chain as "broken" and that code which cares about
following CTID chains forward can see that it's reached a point where
the chain is broken and throw an error saying "hey, I can't do the
stuff we normally do in concurrency scenarions, because the CTID chain
got broken by a cross-partition update".

I don't think it's practical to actually make CTID links across
partitions work. Certainly not in time for v11. If somebody wants to
try that at some point in the future, cool. But that's moving the
goalposts an awfully long way. When this was discussed about a year
ago, my understanding is that there was a consensus that doing nothing
was not acceptable, but that throwing an error in the cases where
anomalies would have happened was good enough. I don't think anyone
argued that we had to be able to perfectly mimic the usual EPQ
semantics as a condition of having update tuple routing. That's
setting the bar at a level that we're not going to be able to reach in
the next couple of weeks. I suppose we could still decide that if we
can't have that, we don't want update tuple routing at all, but I
think that's an overreaction.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#40

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Pavan Deolasee (#30)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
not do anything to deal with the fact that t_ctid may no longer point to
itself to mark end of the chain. I just can't see how that would work.
...
I am actually worried that we're tinkering with ip_blkid to handle one
corner case of detecting partition key update. This is going to change
on-disk format and probably need more careful attention.

You know, either one of those alone would be scary as hell. Both in
one patch seem to me to be sufficient reason to reject it outright.
Not only will it be an unending source of bugs, but it's chewing up
far too much of what few remaining degrees-of-freedom we have in the
on-disk format ... for a single purpose that hasn't even been sold as
something we have to have.

Find another way.

regards, tom lane

#41

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Robert Haas (#39)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Robert Haas <robertmhaas@gmail.com> writes:

... I suppose we could still decide that if we
can't have that, we don't want update tuple routing at all, but I
think that's an overreaction.

Between this thread and
<CAJ3gD9fRbEzDqdeDq1jxqZUb47kJn+tQ7=Bcgjc8quqKsDViKQ@mail.gmail.com>
I am getting the distinct impression that that feature wasn't ready
to be committed. I think that reverting it for v11 is definitely
an option that needs to be kept on the table.

regards, tom lane

#42

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Tom Lane (#40)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
not do anything to deal with the fact that t_ctid may no longer point to
itself to mark end of the chain. I just can't see how that would work.
...
I am actually worried that we're tinkering with ip_blkid to handle one
corner case of detecting partition key update. This is going to change
on-disk format and probably need more careful attention.

You know, either one of those alone would be scary as hell. Both in
one patch seem to me to be sufficient reason to reject it outright.
Not only will it be an unending source of bugs, but it's chewing up
far too much of what few remaining degrees-of-freedom we have in the
on-disk format ... for a single purpose that hasn't even been sold as
something we have to have.

I agree that it isn't clear that it's worth making a change to the
on-disk format for this feature. I made the argument when it was
first proposed that we should just document that there would be
anomalies with cross-partition updates that didn't occur otherwise.
However, multiple people thought that it was worth burning one of our
precious few remaining infomask bits in order to throw an error in
that case rather than just silently having an anomaly, and that's why
this patch got written. It's not too late to decide that we'd rather
not do that after all.

However, there's no such thing as a free lunch. We can't use the CTID
field to point to a CTID in another table because there's no room to
include the identify of the other table in the field. We can't widen
it to make room because that would break on-disk compatibility and
bloat our already-too-big tuple headers. So, we cannot make it work
like it does when the updates are confined to a single partition.
Therefore, the only options are (1) ignore the problem, and let a
cross-partition update look entirely like a delete+insert, (2) try to
throw some error in the case where this introduces user-visible
anomalies that wouldn't be visible otherwise, or (3) revert update
tuple routing entirely. I voted for (1), but the consensus was (2).
I think that (3) will make a lot of people sad; it's a very good
feature. If we want to have (2), then we've got to have some way to
mark a tuple that was deleted as part of a cross-partition update, and
that requires a change to the on-disk format.

In short, the two things that you are claiming are prohibitively scary
if done in the same patch look to me like they're actually just one
thing, and that one thing is something which absolutely has to be done
in order to implement the design most community members favored in the
original discussion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#43

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Robert Haas (#42)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Robert Haas <robertmhaas@gmail.com> writes:

Therefore, the only options are (1) ignore the problem, and let a
cross-partition update look entirely like a delete+insert, (2) try to
throw some error in the case where this introduces user-visible
anomalies that wouldn't be visible otherwise, or (3) revert update
tuple routing entirely. I voted for (1), but the consensus was (2).

FWIW, I would also vote for (1), especially if the only way to do (2)
is stuff as outright scary as this. I would far rather have (3) than
this, because IMO, what we are looking at right now is going to make
the fallout from multixacts look like a pleasant day at the beach.

regards, tom lane

#44

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Tom Lane (#43)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Therefore, the only options are (1) ignore the problem, and let a
cross-partition update look entirely like a delete+insert, (2) try to
throw some error in the case where this introduces user-visible
anomalies that wouldn't be visible otherwise, or (3) revert update
tuple routing entirely. I voted for (1), but the consensus was (2).

FWIW, I would also vote for (1), especially if the only way to do (2)
is stuff as outright scary as this. I would far rather have (3) than
this, because IMO, what we are looking at right now is going to make
the fallout from multixacts look like a pleasant day at the beach.

Whoa. Well, that would clearly be bad, but I don't understand why you
find this so scary. Can you explain further?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#45

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Robert Haas (#42)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 10:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

However, there's no such thing as a free lunch. We can't use the CTID
field to point to a CTID in another table because there's no room to
include the identify of the other table in the field. We can't widen
it to make room because that would break on-disk compatibility and
bloat our already-too-big tuple headers. So, we cannot make it work
like it does when the updates are confined to a single partition.
Therefore, the only options are (1) ignore the problem, and let a
cross-partition update look entirely like a delete+insert, (2) try to
throw some error in the case where this introduces user-visible
anomalies that wouldn't be visible otherwise, or (3) revert update
tuple routing entirely. I voted for (1), but the consensus was (2).
I think that (3) will make a lot of people sad; it's a very good
feature.

I am definitely not suggesting to do #3, though I agree with Tom that the
option is on table. May be two back-to-back bugs in the area makes me
worried and raises questions about the amount of testing the feature has
got. In addition, making such a significant on-disk change for one corner
case, for which even #1 might be acceptable, seems a lot. If we at all want
to go in that direction, I would suggest considering a patch that I wrote
last year to free-up additional bits from the ctid field (as part of the
WARM). I know Tom did not like that either, but at the very least, it
provides us a lot more room for future work, with the same amount of risk.

If we want to have (2), then we've got to have some way to
mark a tuple that was deleted as part of a cross-partition update, and
that requires a change to the on-disk format.

I think the question is: isn't there an alternate way to achieve the same
result? One alternate way would be to do what I suggested above i.e. free
up more bits and use one of those. Another way would be to add a hidden
column to the partition table, when it is created or when it is attached as
a partition. This only penalises the partition tables, but keeps rest of
the system out of it. Obviously, if this column is added when the table is
attached as a partition, as against at table creation time, then the old
tuple may not have room to store this additional field. May be we can
handle that by double updating the tuple? That seems bad, but then it only
impacts the case when a partition key is updated. And we can clearly
document performance implications of that operation. I am not sure how
common this case is going to be anyways. With this hidden column, we can
even store a pointer to another partition and do something with that, if at
all needed.

That's just one idea. Of course, I haven't thought about it for more than
10mins, so most likely I may have missed out on details and it's probably a
stupid idea afterall. But there could be other ideas too. And even if we
can't find one, my vote would be to settle for #1 instead of trying to do
#2.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#46

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#45)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 12:25 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

I think the question is: isn't there an alternate way to achieve the same
result? One alternate way would be to do what I suggested above i.e. free up
more bits and use one of those.

That's certainly possible, but TBH the CTID field seems like a pretty
good choice for this particular feature. I mean, we're essentially
trying to indicate that the CTID link is not valid, so using an
invalid value in the CTID field seems like a pretty natural choice.
We could use, say, an infomask bit to indicate that the CTID link is
not valid, but an infomask bit is more precious. Any two-valued
property can be represented by an infomask bit, but using the CTID
field is only possible for properties that can't be true at the same
time that the CTID field needs to be valid. So it makes sense that
this property, which can't be true at the same time the CTID field
needs to be valid, should try to use an otherwise-unused bit pattern
for the CTID field itself.

Another way would be to add a hidden column
to the partition table, when it is created or when it is attached as a
partition. This only penalises the partition tables, but keeps rest of the
system out of it. Obviously, if this column is added when the table is
attached as a partition, as against at table creation time, then the old
tuple may not have room to store this additional field. May be we can handle
that by double updating the tuple? That seems bad, but then it only impacts
the case when a partition key is updated. And we can clearly document
performance implications of that operation. I am not sure how common this
case is going to be anyways. With this hidden column, we can even store a
pointer to another partition and do something with that, if at all needed.

Sure, but that would mean that partitioned tables would get bigger as
compared with unpartitioned tables, it would break backward
compatibility with v10, and it would require a major redesign of the
system -- the list of "system" columns is deeply embedded in the
system design and previous proposals to add to it have not been met
with wild applause.

That's just one idea. Of course, I haven't thought about it for more than
10mins, so most likely I may have missed out on details and it's probably a
stupid idea afterall. But there could be other ideas too. And even if we
can't find one, my vote would be to settle for #1 instead of trying to do
#2.

Fair enough. I don't really see a reason why we can't make #2 work.
Obviously, the patch touches the on-disk format and is therefore scary
-- that's why I thought it should be broken out of the main update
tuple routing patch -- but it's far less of a structural change than
Alvaro's multixact work or the WARM stuff, at least according to my
current understanding. Tom said he thinks it's riskier than the
multixact stuff but I don't see why that should be the case. That had
widespread impacts on vacuuming and checkpointing that are not at
issue here. Still, there's no question that it's a scary patch and if
the consensus is now that we don't need it -- so be it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#47

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Robert Haas (#44)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I would also vote for (1), especially if the only way to do (2)
is stuff as outright scary as this. I would far rather have (3) than
this, because IMO, what we are looking at right now is going to make
the fallout from multixacts look like a pleasant day at the beach.

Whoa. Well, that would clearly be bad, but I don't understand why you
find this so scary. Can you explain further?

Possibly I'm crying wolf; it's hard to be sure. But I recall that nobody
was particularly afraid of multixacts when that went in, and look at all
the trouble we've had with that. Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble. There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

I would've been happier about expending an infomask bit towards this
purpose. Just eyeing what we've got, I can't help noticing that
HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
in a partitioned table. Perhaps making these tests depend on
partitioned-ness would be unworkably messy, but it's worth thinking
about.

regards, tom lane

#48

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Tom Lane (#47)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On March 8, 2018 10:46:53 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I would also vote for (1), especially if the only way to do

(2)

is stuff as outright scary as this. I would far rather have (3)

than

this, because IMO, what we are looking at right now is going to make
the fallout from multixacts look like a pleasant day at the beach.

Whoa. Well, that would clearly be bad, but I don't understand why

you

find this so scary. Can you explain further?

Possibly I'm crying wolf; it's hard to be sure. But I recall that
nobody
was particularly afraid of multixacts when that went in, and look at
all
the trouble we've had with that. Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble. There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

I would've been happier about expending an infomask bit towards this
purpose. Just eyeing what we've got, I can't help noticing that
HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
in a partitioned table. Perhaps making these tests depend on
partitioned-ness would be unworkably messy, but it's worth thinking
about.

We're pretty much doing so for speculative lock IDs/upsert already. Which doesn't seem to have caused a lot of problems.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#49

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Tom Lane (#47)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On March 8, 2018 10:46:53 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I would also vote for (1), especially if the only way to do

(2)

is stuff as outright scary as this. I would far rather have (3)

than

this, because IMO, what we are looking at right now is going to make
the fallout from multixacts look like a pleasant day at the beach.

Whoa. Well, that would clearly be bad, but I don't understand why

you

find this so scary. Can you explain further?

Possibly I'm crying wolf; it's hard to be sure. But I recall that
nobody
was particularly afraid of multixacts when that went in, and look at
all
the trouble we've had with that. Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble. There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

I don't think this is that big a problem. All code already needs to handle the case where ctid points to an aborted update tuple. Which might long have been replaced by as an independent role. That's why we have all this updated.xmax == new.xmin checks. Which will, without any changes, catch the proposed scheme, no?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#50

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Andres Freund (#49)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Andres Freund <andres@anarazel.de> writes:

On March 8, 2018 10:46:53 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble. There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

I don't think this is that big a problem. All code already needs to handle the case where ctid points to an aborted update tuple. Which might long have been replaced by as an independent role. That's why we have all this updated.xmax == new.xmin checks. Which will, without any changes, catch the proposed scheme, no?

No. In those situations, the conclusion is that the current tuple is
live, which is exactly the wrong conclusion for a cross-partition update.
Or at least it might be the wrong conclusion ... I wonder how this patch
works if the updating transaction aborted.

regards, tom lane

#51

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Tom Lane (#50)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On 2018-03-08 14:25:59 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On March 8, 2018 10:46:53 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble. There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

I don't think this is that big a problem. All code already needs to handle the case where ctid points to an aborted update tuple. Which might long have been replaced by as an independent role. That's why we have all this updated.xmax == new.xmin checks. Which will, without any changes, catch the proposed scheme, no?

No. In those situations, the conclusion is that the current tuple is
live, which is exactly the wrong conclusion for a cross-partition
update.

I don't see the problem you're seeing here. Visibility decisions and
ctid chaining aren't really done in the same way. And in the cases we do
want different behaviour for updates that cross partition boundaries,
the patch adds the error messages. What I was trying to say is not that
we don't need to touch any of those paths, but that there's code to
handle bogus ctid values already. That really wasn't the case for
multixacts (in fact, they broke this check in multiple places).

Or at least it might be the wrong conclusion ... I wonder how this patch
works if the updating transaction aborted.

If the updated transaction aborted, HTSU will return
HeapTupleMayBeUpdated and we can just go ahead and allow an update?

Greetings,

Andres Freund

#52

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Andres Freund (#27)

3 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Hi Andres,

Thanks for your time and the review comments/suggestions.

On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-02-13 12:41:26 +0530, amul sul wrote:

From 08c8c7ece7d9411e70a780dbeed89d81419db6b6 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:52 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v5

[....]

Very nice and instructive to keep this in a submission's commit message.

Thank you.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8a846e7dba..f4560ee9cb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3037,6 +3037,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
*   crosscheck - if not InvalidSnapshot, also check tuple against this
*   wait - true if should wait for any conflicting update to commit/abort
*   hufd - output parameter, filled in failure cases (see below)
+ *   row_moved - true iff the tuple is being moved to another partition
+ *                           table due to an update of partition key. Otherwise, false.
*

I don't think 'row_moved' is a good variable name for this. Moving a row
in our heap format can mean a lot of things. Maybe 'to_other_part' or
'changing_part'?

Okay, renamed to 'changing_part' in the attached version.

+     /*
+      * Sets a block identifier to the InvalidBlockNumber to indicate such an
+      * update being moved tuple to another partition.
+      */
+     if (row_moved)
+             BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
The parens here are set in a bit werid way. I assume that's from copying
it out of ItemPointerSet()? Why aren't you just using ItemPointerSetBlockNumber()?

I think it'd be better if we followed the example of specultive inserts
and created an equivalent of HeapTupleHeaderSetSpeculativeToken. That'd
be a heck of a lot easier to grep for...

Added HeapTupleHeaderValidBlockNumber, HeapTupleHeaderSetBlockNumber and
ItemPointerValidBlockNumber macro, but not exactly same as the
HeapTupleHeaderSetSpeculativeToken. Do let me know your thoughts/suggestions.

@@ -9314,6 +9323,18 @@ heap_mask(char *pagedata, BlockNumber blkno)
*/
if (HeapTupleHeaderIsSpeculative(page_htup))
ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+                     /*
+                      * For a deleted tuple, a block identifier is set to the

I think this 'the' is superflous.

Fixed in the attached version.

+                      * InvalidBlockNumber to indicate that the tuple has been moved to
+                      * another partition due to an update of partition key.

But I think it should be 'the partition key'.

Fixed in the attached version.

+                      * Like speculative tuple, to ignore any inconsistency set block
+                      * identifier to current block number.

This doesn't quite parse.

Tried to explain a little bit more, any help or suggestion to improve it
further will be appreciated.

+                      */
+                     if (!BlockNumberIsValid(
+                                     BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)))))
+                             BlockIdSet(&((page_htup->t_ctid).ip_blkid), blkno);
}

That formatting looks wrong. I think it should be replaced by a macro
like mentioned above.

Used HeapTupleHeaderValidBlockNumber & HeapTupleHeaderSetBlockNumber
macro in the attached version.

/*
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 160d941c00..a770531e14 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+                                     ereport(ERROR,
+                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                                      errmsg("tuple to be locked was already moved to another partition due to concurrent update")));

Yes, given that we repeat this in multiple places, I *definitely* want
to see this wrapped in a macro with a descriptive name.

Used ItemPointerValidBlockNumber macro all such places.

diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+                                     ereport(ERROR,
+                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                                      errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+

No change, any comments on Amit's response[1]

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
I'd like to see tests that show various interactions with ON CONFLICT.

I've added isolation test for ON CONFLICT DO NOTHING case only, ON CONFLICT DO
UPDATE is yet to support for a partitioned table[2]. But one can we do that
with update row movement if can test ON CONFLICT DO UPDATE on the leaf table,
like attached TRIAL-on-conflict-do-update-wip.patch, thoughts?

In addition, I have added invalid block number check at the few places, as
discussed in [3]. Also, added check in heap_lock_updated_tuple,
rewrite_heap_tuple & EvalPlanQualFetch where ItemPointerEquals() used
to conclude tuple has been updated/deleted, but yet to figure out the
way to hit this changes manually, so that marking the patch as wip.

Regards,
Amul

1] /messages/by-id/CAA4eK1KFfm4PBbshNSikdru3Qpt8hUoKQWtBYjdVE2R7U9f6iA@mail.gmail.com
2] /messages/by-id/20180228004602.cwdyralmg5ejdqkq@alvherre.pgsql
3] /messages/by-id/CAAJ_b97BBkRWFowGRs9VNzFykoK0ikGB1yYEsWfYK8xR5enSrw@mail.gmail.com

Attachments:

0001-Invalidate-ip_blkid-v6-wip.patchapplication/octet-stream; name=0001-Invalidate-ip_blkid-v6-wip.patchDownload

From d1ce6eb1342590340bb7eab99dd7ac91697a684f Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:52 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v6-wip

v6-wip: Update w.r.t Andres Freund review comments[8]
 - Added HeapTupleHeaderValidBlockNumber, HeapTupleHeaderSetBlockNumber
   and ItemPointerValidBlockNumber macro.
 - Fixed comments as per Andres' suggestions.
 - Also added invalid block number check in heap_get_latest_tid as
   discussed in the thread[9], similar change made in heap_lock_updated_tuple_rec.
 - In heap_lock_updated_tuple, rewrite_heap_tuple & EvalPlanQualFetch,
   I've added check for invalid block number where ItemPointerEquals used
   to conclude tuple has been updated/deleted.

 TODO:
 1. Yet to test changes made in the heap_lock_updated_tuple,
    rewrite_heap_tuple & EvalPlanQualFetch function are valid or not.
    Also, address TODO tags in the same function.
 2. Also check there are other places where similar changes like
    above needed(wrt Pavan Deolasee[10] concern)

v5:
 - Added code in heap_mask to skip wal_consistency_checking[7]
 - Fixed previous todos.

v5-wip2:
 - Minor changes in RelationFindReplTupleByIndex() and
   RelationFindReplTupleSeq()

 - TODO;
   Same as the privious

v5-wip: Update w.r.t Amit Kapila's comments[6].
 - Reverted error message in nodeModifyTable.c from 'tuple to be locked'
   to 'tuple to be updated'.

 - TODO:
 1. Yet to made a decision of having LOG/ELOG/ASSERT in the
    RelationFindReplTupleByIndex() and RelationFindReplTupleSeq().

v4: Rebased on "UPDATE of partition key v35" patch[5].

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
 - typo in all error message and comment : "to an another" -> "to another"
 - error message change : "tuple to be updated" -> "tuple to be locked"
 - In ExecOnConflictUpdate(), error report converted into assert &
  comments added.

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
 3] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
 4] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql
 5] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 6] https://postgr.es/m/CAA4eK1LHVnNWYF53F1gUGx6CTxuvznozvU-Lr-dfE=Qeu1gEcg@mail.gmail.com
 7] https://postgr.es/m/CAAJ_b94_29wiUA83W8LQjtfjv9XNV=+PT8+ioWRPjnnFHe3eqw@mail.gmail.com
 8] https://postgr.es/m/20180305232353.gpue7jldnm4bjf4i@alap3.anarazel.de
 9] https://postgr.es/m/CAAJ_b97BBkRWFowGRs9VNzFykoK0ikGB1yYEsWfYK8xR5enSrw@mail.gmail.com
10] https://postgr.es/m/CABOikdPXwqkLGgTZZm2qYwTn4L69V36rCh55fFma1fAYbon7Vg@mail.gmail.com
---
 src/backend/access/heap/heapam.c       | 40 ++++++++++++++++++++++++++++++----
 src/backend/access/heap/rewriteheap.c  |  5 ++++-
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  9 +++++++-
 src/backend/executor/execReplication.c | 26 +++++++++++++++-------
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 28 ++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/access/htup_details.h      |  6 +++++
 src/include/storage/itemptr.h          | 11 +++++++++-
 10 files changed, 117 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c08ab14c02..e6b02ce984 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2304,7 +2304,8 @@ heap_get_latest_tid(Relation relation,
 		 */
 		if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
 			HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
-			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
+			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid) ||
+			!HeapTupleHeaderValidBlockNumber(tp.t_data))
 		{
 			UnlockReleaseBuffer(buffer);
 			break;
@@ -2314,6 +2315,9 @@ heap_get_latest_tid(Relation relation,
 		priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
 		UnlockReleaseBuffer(buffer);
 	}							/* end of loop */
+
+	/* Make sure that return value has a valid block number */
+	Assert(ItemPointerValidBlockNumber(tid));
 }
 
 
@@ -3037,6 +3041,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	changing_part - true iff the tuple is being moved to another partition
+ *					table due to an update of partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3052,7 +3058,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool changing_part)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3320,6 +3326,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (changing_part)
+		HeapTupleHeaderSetBlockNumber(tp.t_data, InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3445,7 +3458,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -5957,6 +5970,7 @@ next:
 		/* if we find the end of update chain, we're done. */
 		if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
 			ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+			!HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
 			HeapTupleHeaderIsOnlyLocked(mytup.t_data))
 		{
 			result = HeapTupleMayBeUpdated;
@@ -6007,7 +6021,10 @@ static HTSU_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, ItemPointer ctid,
 						TransactionId xid, LockTupleMode mode)
 {
-	if (!ItemPointerEquals(&tuple->t_self, ctid))
+	if (!(ItemPointerEquals(&tuple->t_self, ctid) ||
+		  (!ItemPointerValidBlockNumber(ctid) &&
+		   (ItemPointerGetOffsetNumber(&tuple->t_self) ==   /* TODO: Condn. should be macro? */
+			ItemPointerGetOffsetNumber(ctid)))))
 	{
 		/*
 		 * If this is the first possibly-multixact-able operation in the
@@ -9322,6 +9339,21 @@ heap_mask(char *pagedata, BlockNumber blkno)
 			 */
 			if (HeapTupleHeaderIsSpeculative(page_htup))
 				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+			/*
+			 * For a deleted tuple, a block identifier is set to
+			 * InvalidBlockNumber to indicate that the tuple has been moved to
+			 * another partition due to an update of the partition key.
+			 *
+			 * During redo, heap_xlog_delete sets t_ctid to current block
+			 * number and self offset number.  It doesn't verify the tuple is
+			 * deleted by usual delete/update or deleted by the update of the
+			 * partition key on the master.  Hence, like speculative tuple, to
+			 * ignore any inconsistency set block identifier to current block
+			 * number.
+			 */
+			if (!HeapTupleHeaderValidBlockNumber(page_htup))
+				HeapTupleHeaderSetBlockNumber(page_htup, blkno);
 		}
 
 		/*
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2588..96c07d9de9 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -425,7 +425,10 @@ rewrite_heap_tuple(RewriteState state,
 	if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
 		  HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
 		!(ItemPointerEquals(&(old_tuple->t_self),
-							&(old_tuple->t_data->t_ctid))))
+							&(old_tuple->t_data->t_ctid)) ||
+		  (!HeapTupleHeaderValidBlockNumber(old_tuple->t_data) &&
+		   ItemPointerGetOffsetNumber(&(old_tuple->t_self)) ==		/* TODO: Condn. should be macro? */
+		   ItemPointerGetOffsetNumber(&(old_tuple->t_data->t_ctid)))))
 	{
 		OldToNewMapping mapping;
 
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index fbd176b5d0..93c1f2a51f 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 91ba939bdc..49b2b96fbd 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2712,6 +2712,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!ItemPointerValidBlockNumber(&hufd.ctid))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
@@ -2780,7 +2784,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 		 * As above, it should be safe to examine xmax and t_ctid without the
 		 * buffer content lock, because they can't be changing.
 		 */
-		if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+		if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid) ||
+			(!HeapTupleHeaderValidBlockNumber(tuple.t_data) &&
+			 ItemPointerGetOffsetNumber(&tuple.t_self) ==		 /* TODO: Condn. should be macro? */
+			 ItemPointerGetOffsetNumber(&tuple.t_data->t_ctid)))
 		{
 			/* deleted, so forget about it */
 			ReleaseBuffer(buffer);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 32891abbdf..8430420de7 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -190,10 +190,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -348,10 +353,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b39ccf7dc1..dd4d5f25ca 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c32928d9bd..a6b9133eba 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -719,7 +719,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tupleDeleted,
 		   bool processReturning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool changing_part)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -810,7 +811,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 changing_part);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -856,6 +858,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1158,7 +1165,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1303,6 +1310,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1473,6 +1485,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of partition key.
+			 */
+			Assert(ItemPointerValidBlockNumber(&hufd.ctid));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2062,7 +2082,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..e8da83c303 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool changing_part);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 2ab1815390..12df36c70b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -441,6 +441,12 @@ do { \
 	ItemPointerSet(&(tup)->t_ctid, token, SpecTokenOffsetNumber) \
 )
 
+#define HeapTupleHeaderSetBlockNumber(tup, blkno) \
+		ItemPointerSetBlockNumber(&(tup)->t_ctid, blkno)
+
+#define HeapTupleHeaderValidBlockNumber(tup) \
+		ItemPointerValidBlockNumber(&(tup)->t_ctid)
+
 #define HeapTupleHeaderGetDatumLength(tup) \
 	VARSIZE(tup)
 
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..49a4509561 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * partition key.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
@@ -60,6 +62,13 @@ typedef ItemPointerData *ItemPointer;
 #define ItemPointerIsValid(pointer) \
 	((bool) (PointerIsValid(pointer) && ((pointer)->ip_posid != 0)))
 
+/*
+ * ItemPointerIsValid
+ *		True iff the block number of the item pointer is valid.
+ */
+#define ItemPointerValidBlockNumber(pointer) \
+	((bool) (BlockNumberIsValid(ItemPointerGetBlockNumberNoCheck(pointer))))
+
 /*
  * ItemPointerGetBlockNumberNoCheck
  *		Returns the block number of a disk item pointer.
-- 
2.14.1

0002-isolation-tests-v5.patchapplication/octet-stream; name=0002-isolation-tests-v5.patchDownload

From 0e11b2c2e72df0009ca7d3bd8c6f7e1a05e1773d Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:33 +0530
Subject: [PATCH 2/2] isolation tests v5

v5:
 - As per Andres Freund suggestion[4], added test for ON CONFLICT DO
   NOTHING

 - TODO:
    1. Cannot add ON CONFLICT DO UPDATE test since it's not supported
       for partitioned table, may be after proposed patch[5]

v4:
 - Rebased on Invalidate ip_blkid v5.

v3:
 - Rebase on "UPDATE of partition key v35" patch[2] and
  latest maste head[3].

v2:
 - Error message changed.
 - Can't add isolation test[1] for
 	RelationFindReplTupleByIndex & RelationFindReplTupleSeq
 - In ExecOnConflictUpdate, the error report is converted to assert
   check.

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

 ------------
  References:
 ------------
 1] https://postgr.es/m/CA+TgmoYsMRo2PHFTGUFifv4ZSCZ9LNJASbOyb=9it2=UA4j4vw@mail.gmail.com
 2] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 3] Commit id bdb70c12b3a2e69eec6e51411df60d9f43ecc841
 4] https://postgr.es/m/20180305232353.gpue7jldnm4bjf4i@alap3.anarazel.de
 5] https://postgr.es/m/20180228004602.cwdyralmg5ejdqkq@alvherre.pgsql
---
 .../isolation/expected/partition-key-update-1.out  |  35 ++++++
 .../isolation/expected/partition-key-update-2.out  |  18 +++
 .../isolation/expected/partition-key-update-3.out  |   8 ++
 .../isolation/expected/partition-key-update-4.out  |  29 +++++
 .../isolation/expected/partition-key-update-5.out  | 139 +++++++++++++++++++++
 src/test/isolation/isolation_schedule              |   5 +
 .../isolation/specs/partition-key-update-1.spec    |  37 ++++++
 .../isolation/specs/partition-key-update-2.spec    |  39 ++++++
 .../isolation/specs/partition-key-update-3.spec    |  30 +++++
 .../isolation/specs/partition-key-update-4.spec    |  45 +++++++
 .../isolation/specs/partition-key-update-5.spec    |  44 +++++++
 11 files changed, 429 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/expected/partition-key-update-4.out
 create mode 100644 src/test/isolation/expected/partition-key-update-5.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-4.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-5.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..56bf4450b0
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,35 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+
+starting permutation: s1u s2d s1c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+
+starting permutation: s2d s1u s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..195ec4cedf
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,18 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+
+starting permutation: s1u s2u s1c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+
+starting permutation: s2u s1u s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..1922bdce46
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,8 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 0000000000..363de0d69c
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,29 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1u s2donothing s3donothing s1c s2c s3select s3c
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
+
+starting permutation: s2donothing s1u s3donothing s1c s2c s3select s3c
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-5.out b/src/test/isolation/expected/partition-key-update-5.out
new file mode 100644
index 0000000000..42dfe64ad3
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-5.out
@@ -0,0 +1,139 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59546..26f88c50b6 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -66,3 +66,8 @@ test: async-notify
 test: vacuum-reltuples
 test: timeouts
 test: vacuum-concurrent-drop
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
+test: partition-key-update-4
+test: partition-key-update-5
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..db76c9a9b5
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,37 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
+
+permutation "s1u" "s1c" "s2d"
+permutation "s1u" "s2d" "s1c"
+permutation "s2d" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..b09e76ce21
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,39 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+
+permutation "s1u" "s1c" "s2u"
+permutation "s1u" "s2u" "s1c"
+permutation "s2u" "s1u" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..c1f547d9ba
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,30 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+
+permutation "s1u3" "s2i" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 0000000000..48ebe6c7d5
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,45 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING test
+#
+# This test tries to expose problems with the interaction between concurrent
+# sessions during an update of the partition key and INSERT...ON CONFLICT DO
+# NOTHING on a partitioned table.
+#
+# The convention here is that session 1 moves row from one partition to
+# another due update of the partition key and session 2 always ends up
+# inserting, and session 3 always ends up doing nothing.
+#
+# Note: This test is slightly resemble to insert-conflict-do-nothing test.
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c"	{ COMMIT; }
+
+session "s3"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; }
+step "s3select" { SELECT * FROM foo ORDER BY a; }
+step "s3c"	{ COMMIT; }
+
+# Regular case where one session block-waits on another to determine if it
+# should proceed with an insert or do nothing.
+permutation "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3select" "s3c"
+permutation "s2donothing" "s1u" "s3donothing" "s1c" "s2c" "s3select" "s3c"
diff --git a/src/test/isolation/specs/partition-key-update-5.spec b/src/test/isolation/specs/partition-key-update-5.spec
new file mode 100644
index 0000000000..7f9a06913a
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-5.spec
@@ -0,0 +1,44 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING
+# test on partitioned table with multiple rows in higher isolation levels.
+#
+# Note: This test is resemble to insert-conflict-do-nothing-2 test
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s2begins"	{ BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c" { COMMIT; }
+step "s2select" { SELECT * FROM foo ORDER BY a; }
+
+session "s3"
+step "s3beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s3begins" { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; }
+step "s3c" { COMMIT; }
+
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
-- 
2.14.1

TRIAL-on-conflict-do-update-wip.patchapplication/octet-stream; name=TRIAL-on-conflict-do-update-wip.patchDownload

From 36c4018ccae3be007bb1b5754d5e9eb59b2fe1bb Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Wed, 7 Mar 2018 15:31:41 +0530
Subject: [PATCH] Isolation trial on confict do update wip

---
 .../isolation/expected/partition-key-update-6.out  | 60 ++++++++++++++++++++++
 .../isolation/specs/partition-key-update-6.spec    | 44 ++++++++++++++++
 2 files changed, 104 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-6.out
 create mode 100644 src/test/isolation/specs/partition-key-update-6.spec

diff --git a/src/test/isolation/expected/partition-key-update-6.out b/src/test/isolation/expected/partition-key-update-6.out
new file mode 100644
index 0000000000..7d4c7282c1
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-6.out
@@ -0,0 +1,60 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1u s2insert s3insert s1c s2c s3c s3select
+step s1u: UPDATE foo SET a=2, b=b ||' -> moved by session-1' WHERE a=1;
+step s2insert: INSERT INTO foo1 VALUES(1, 'session-2 insert') ON CONFLICT (a) DO UPDATE set b = foo1.b || ' -> updated by session-2 insert'; <waiting ...>
+step s3insert: INSERT INTO foo2 VALUES(2, 'session-3 insert') ON CONFLICT (a) DO UPDATE set b = foo2.b || ' -> updated by session-3 insert'; <waiting ...>
+step s1c: COMMIT;
+step s2insert: <... completed>
+step s3insert: <... completed>
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s3select: SELECT * FROM foo;
+a              b              
+
+1              session-2 insert
+2              initial tuple -> moved by session-1 -> updated by session-3 insert
+
+starting permutation: s1u s3insert s2insert s1c s3c s2c s3select
+step s1u: UPDATE foo SET a=2, b=b ||' -> moved by session-1' WHERE a=1;
+step s3insert: INSERT INTO foo2 VALUES(2, 'session-3 insert') ON CONFLICT (a) DO UPDATE set b = foo2.b || ' -> updated by session-3 insert'; <waiting ...>
+step s2insert: INSERT INTO foo1 VALUES(1, 'session-2 insert') ON CONFLICT (a) DO UPDATE set b = foo1.b || ' -> updated by session-2 insert'; <waiting ...>
+step s1c: COMMIT;
+step s3insert: <... completed>
+step s2insert: <... completed>
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo;
+a              b              
+
+1              session-2 insert
+2              initial tuple -> moved by session-1 -> updated by session-3 insert
+
+starting permutation: s2insert s1u s2c s3insert s1c s3c s3select
+step s2insert: INSERT INTO foo1 VALUES(1, 'session-2 insert') ON CONFLICT (a) DO UPDATE set b = foo1.b || ' -> updated by session-2 insert';
+step s1u: UPDATE foo SET a=2, b=b ||' -> moved by session-1' WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s3insert: INSERT INTO foo2 VALUES(2, 'session-3 insert') ON CONFLICT (a) DO UPDATE set b = foo2.b || ' -> updated by session-3 insert'; <waiting ...>
+step s1c: COMMIT;
+step s3insert: <... completed>
+step s3c: COMMIT;
+step s3select: SELECT * FROM foo;
+a              b              
+
+2              initial tuple -> moved by session-1 -> updated by session-3 insert
+
+starting permutation: s3insert s1u s3c s2insert s1c s2c s3select
+step s3insert: INSERT INTO foo2 VALUES(2, 'session-3 insert') ON CONFLICT (a) DO UPDATE set b = foo2.b || ' -> updated by session-3 insert';
+step s1u: UPDATE foo SET a=2, b=b ||' -> moved by session-1' WHERE a=1; <waiting ...>
+step s3c: COMMIT;
+step s1u: <... completed>
+error in steps s3c s1u: ERROR:  duplicate key value violates unique constraint "foo2_pkey"
+step s2insert: INSERT INTO foo1 VALUES(1, 'session-2 insert') ON CONFLICT (a) DO UPDATE set b = foo1.b || ' -> updated by session-2 insert';
+step s1c: COMMIT;
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo;
+a              b              
+
+1              initial tuple -> updated by session-2 insert
+2              session-3 insert
diff --git a/src/test/isolation/specs/partition-key-update-6.spec b/src/test/isolation/specs/partition-key-update-6.spec
new file mode 100644
index 0000000000..39ff344cfd
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-6.spec
@@ -0,0 +1,44 @@
+# INSERT...ON CONFLICT DO UPDATE test
+#
+# This test tries to expose problems with the interaction between concurrent
+# sessions.
+#
+# FIXME: Since ON CONFLICT DO UPDATE not supported for the partitioned table,
+#        INSERT...ON CONFLICT DO UPDATE query executed for child relations.
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET a=2, b=b ||' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2insert" { INSERT INTO foo1 VALUES(1, 'session-2 insert') ON CONFLICT (a) DO UPDATE set b = foo1.b || ' -> updated by session-2 insert'; }
+step "s2c" { COMMIT; }
+
+session "s3"
+setup { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s3insert" { INSERT INTO foo2 VALUES(2, 'session-3 insert') ON CONFLICT (a) DO UPDATE set b = foo2.b || ' -> updated by session-3 insert'; }
+step "s3select" { SELECT * FROM foo; }
+step "s3c" { COMMIT; }
+
+# One session (session 2) block-waits on another (session 1) to determine if it
+# should proceed with an insert or update.  Notably, this entails updating a
+# tuple while there is no version of that tuple visible to the updating
+# session's snapshot.  This is permitted only in READ COMMITTED mode.
+permutation "s1u" "s2insert" "s3insert" "s1c" "s2c" "s3c" "s3select"
+permutation "s1u" "s3insert" "s2insert" "s1c" "s3c" "s2c" "s3select"
+permutation "s2insert" "s1u" "s2c" "s3insert" "s1c" "s3c" "s3select"
+permutation "s3insert" "s1u" "s3c" "s2insert" "s1c" "s2c" "s3select"
-- 
2.14.1

#53

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#32)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:

Thanks for the confirmation, updated patch attached.

I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
do anything to deal with the fact that t_ctid may no longer point to itself
to mark end of the chain. I just can't see how that would work.

I think it is not that patch doesn't care about the end of the chain.
For example, ctid pointing to itself is used to ensure that for
deleted rows, nothing more needs to be done like below check in the
ExecUpdate/ExecDelete code path.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

It will deal with such cases by checking invalidblockid before these
checks. So, we should be fine in such cases.

But if it
does, it needs good amount of comments explaining why and most likely
updating comments at other places where chain following is done. For
example, how's this code in heap_get_latest_tid() is still valid? Aren't we
setting "ctid" to some invalid value here?

2302 /*
2303 * If there's a valid t_ctid link, follow it, else we're done.
2304 */
2305 if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
2306 HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
2307 ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
2308 {
2309 UnlockReleaseBuffer(buffer);
2310 break;
2311 }
2312
2313 ctid = tp.t_data->t_ctid;

I have not tested, but it seems this could be problematic, but I feel
we could deal with such cases by checking invalid block id in the
above if check. Another one such case is in EvalPlanQualFetch

This is just one example. I am almost certain there are many such cases that
will require careful attention.

Right, I think we should be able to detect and fix such cases.

I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
use to check tuple has been updated/deleted. With the proposed patch
ItemPointerEquals() will no longer work as before, we required addition check
for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
your thoughts/suggestions on this, thanks.

Regards,
Amul

1] /messages/by-id/CAAJ_b96mw5xn5oSQgxpgn5dWFRs1j7OebpHRmXkdSNY+70yYEw@mail.gmail.com

#54

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: amul sul (#53)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Mar 9, 2018 at 3:18 PM, amul sul <sulamul@gmail.com> wrote:

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee

This is just one example. I am almost certain there are many such cases that
will require careful attention.

Right, I think we should be able to detect and fix such cases.

I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
use to check tuple has been updated/deleted. With the proposed patch
ItemPointerEquals() will no longer work as before, we required addition check
for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
your thoughts/suggestions on this, thanks.

I think you have identified the places correctly. I have few
suggestions though.

1.
- if (!ItemPointerEquals(&tuple->t_self, ctid))
+ if (!(ItemPointerEquals(&tuple->t_self, ctid) ||
+   (!ItemPointerValidBlockNumber(ctid) &&
+    (ItemPointerGetOffsetNumber(&tuple->t_self) ==   /* TODO: Condn.
should be macro? */
+ ItemPointerGetOffsetNumber(ctid)))))

Can't we write this and similar tests as:
ItemPointerValidBlockNumber(ctid) &&
!ItemPointerEquals(&tuple->t_self, ctid)? It will be bit simpler to
understand and serve the purpose.

if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ !HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))

I think it is better to keep the check for
HeapTupleHeaderValidBlockNumber earlier than ItemPointerEquals as it
will first validate if block number is valid and then only compare the
complete CTID.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#55

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#54)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Sat, Mar 10, 2018 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 9, 2018 at 3:18 PM, amul sul <sulamul@gmail.com> wrote:

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee

This is just one example. I am almost certain there are many such cases that
will require careful attention.

Right, I think we should be able to detect and fix such cases.

I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
use to check tuple has been updated/deleted. With the proposed patch
ItemPointerEquals() will no longer work as before, we required addition check
for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
your thoughts/suggestions on this, thanks.

I think you have identified the places correctly. I have few
suggestions though.
1.
- if (!ItemPointerEquals(&tuple->t_self, ctid))
+ if (!(ItemPointerEquals(&tuple->t_self, ctid) ||
+   (!ItemPointerValidBlockNumber(ctid) &&
+    (ItemPointerGetOffsetNumber(&tuple->t_self) ==   /* TODO: Condn.
should be macro? */
+ ItemPointerGetOffsetNumber(ctid)))))
Can't we write this and similar tests as:
ItemPointerValidBlockNumber(ctid) &&
!ItemPointerEquals(&tuple->t_self, ctid)? It will be bit simpler to
understand and serve the purpose.

Yes, you are correct, we need not worry about offset matching -- invalid block
number check and ItemPointerEquals is more than enough to conclude the tuple has
been deleted or not. Will change the condition accordingly in the next version.

2.

if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ !HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))

I think it is better to keep the check for
HeapTupleHeaderValidBlockNumber earlier than ItemPointerEquals as it
will first validate if block number is valid and then only compare the
complete CTID.

Sure, will do that.

Thanks for the confirmation and suggestions.

Regards,
Amul

#56

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: amul sul (#55)

2 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Mon, Mar 12, 2018 at 11:45 AM, amul sul <sulamul@gmail.com> wrote:

On Sat, Mar 10, 2018 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Mar 9, 2018 at 3:18 PM, amul sul <sulamul@gmail.com> wrote:

On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee

This is just one example. I am almost certain there are many such cases that
will require careful attention.

Right, I think we should be able to detect and fix such cases.

I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
use to check tuple has been updated/deleted. With the proposed patch
ItemPointerEquals() will no longer work as before, we required addition check
for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
your thoughts/suggestions on this, thanks.

I think you have identified the places correctly. I have few
suggestions though.
1.
- if (!ItemPointerEquals(&tuple->t_self, ctid))
+ if (!(ItemPointerEquals(&tuple->t_self, ctid) ||
+   (!ItemPointerValidBlockNumber(ctid) &&
+    (ItemPointerGetOffsetNumber(&tuple->t_self) ==   /* TODO: Condn.
should be macro? */
+ ItemPointerGetOffsetNumber(ctid)))))
Can't we write this and similar tests as:
ItemPointerValidBlockNumber(ctid) &&
!ItemPointerEquals(&tuple->t_self, ctid)? It will be bit simpler to
understand and serve the purpose.
Yes, you are correct, we need not worry about offset matching -- invalid block
number check and ItemPointerEquals is more than enough to conclude the tuple has
been deleted or not. Will change the condition accordingly in the next version.

2.

if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ !HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
HeapTupleHeaderIsOnlyLocked(mytup.t_data))

I think it is better to keep the check for
HeapTupleHeaderValidBlockNumber earlier than ItemPointerEquals as it
will first validate if block number is valid and then only compare the
complete CTID.

Sure, will do that.

I did the aforementioned changes in the attached patch, thanks.

Regards,
Amul

Attachments:

0001-Invalidate-ip_blkid-v6.patchapplication/octet-stream; name=0001-Invalidate-ip_blkid-v6.patchDownload

From 889d108c71bb390543826f982d6163d122db6979 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:52 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v6

v6:
 - Changes w.r.t Amit's suggestion[11].
 - AFAICS, Other than previous places(v6-wip), I didn't found any
   other places where we need the check for invalid block numberother than.
 - Minor change in the comments : 'partition key' -> 'the partition key'
 - ran pgindent.

v6-wip: Update w.r.t Andres Freund review comments[8]
 - Added HeapTupleHeaderValidBlockNumber, HeapTupleHeaderSetBlockNumber
   and ItemPointerValidBlockNumber macro.
 - Fixed comments as per Andres' suggestions.
 - Also added invalid block number check in heap_get_latest_tid as
   discussed in the thread[9], similar change made in heap_lock_updated_tuple_rec.
 - In heap_lock_updated_tuple, rewrite_heap_tuple & EvalPlanQualFetch,
   I've added check for invalid block number where ItemPointerEquals used
   to conclude tuple has been updated/deleted.

 TODO:
 1. Yet to test changes made in the heap_lock_updated_tuple,
    rewrite_heap_tuple & EvalPlanQualFetch function are valid or not.
    Also, address TODO tags in the same function.
 2. Also check there are other places where similar changes like
    above needed(wrt Pavan Deolasee[10] concern)

v5:
 - Added code in heap_mask to skip wal_consistency_checking[7]
 - Fixed previous todos.

v5-wip2:
 - Minor changes in RelationFindReplTupleByIndex() and
   RelationFindReplTupleSeq()

 - TODO;
   Same as the privious

v5-wip: Update w.r.t Amit Kapila's comments[6].
 - Reverted error message in nodeModifyTable.c from 'tuple to be locked'
   to 'tuple to be updated'.

 - TODO:
 1. Yet to made a decision of having LOG/ELOG/ASSERT in the
    RelationFindReplTupleByIndex() and RelationFindReplTupleSeq().

v4: Rebased on "UPDATE of partition key v35" patch[5].

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
 - typo in all error message and comment : "to an another" -> "to another"
 - error message change : "tuple to be updated" -> "tuple to be locked"
 - In ExecOnConflictUpdate(), error report converted into assert &
  comments added.

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
 3] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
 4] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql
 5] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 6] https://postgr.es/m/CAA4eK1LHVnNWYF53F1gUGx6CTxuvznozvU-Lr-dfE=Qeu1gEcg@mail.gmail.com
 7] https://postgr.es/m/CAAJ_b94_29wiUA83W8LQjtfjv9XNV=+PT8+ioWRPjnnFHe3eqw@mail.gmail.com
 8] https://postgr.es/m/20180305232353.gpue7jldnm4bjf4i@alap3.anarazel.de
 9] https://postgr.es/m/CAAJ_b97BBkRWFowGRs9VNzFykoK0ikGB1yYEsWfYK8xR5enSrw@mail.gmail.com
10] https://postgr.es/m/CABOikdPXwqkLGgTZZm2qYwTn4L69V36rCh55fFma1fAYbon7Vg@mail.gmail.com
11] https://postgr.es/m/CAAJ_b97ohg=+WfiFT3g2x14rvXsXOFXvjH43GkYbgcLZvF7k+w@mail.gmail.com
---
 src/backend/access/heap/heapam.c       | 37 +++++++++++++++++++++++++++++++---
 src/backend/access/heap/rewriteheap.c  |  1 +
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  7 ++++++-
 src/backend/executor/execReplication.c | 26 ++++++++++++++++--------
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 28 +++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/access/htup_details.h      |  6 ++++++
 src/include/storage/itemptr.h          | 11 +++++++++-
 10 files changed, 110 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c08ab14c02..54e6c6bde0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2304,6 +2304,7 @@ heap_get_latest_tid(Relation relation,
 		 */
 		if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
 			HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
+			!HeapTupleHeaderValidBlockNumber(tp.t_data) ||
 			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
 		{
 			UnlockReleaseBuffer(buffer);
@@ -2314,6 +2315,9 @@ heap_get_latest_tid(Relation relation,
 		priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
 		UnlockReleaseBuffer(buffer);
 	}							/* end of loop */
+
+	/* Make sure that the return value has a valid block number */
+	Assert(ItemPointerValidBlockNumber(tid));
 }
 
 
@@ -3037,6 +3041,9 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	changing_part - true iff the tuple is being moved to another partition
+ *					table due to an update of the partition key. Otherwise,
+ *					false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3052,7 +3059,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool changing_part)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3320,6 +3327,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (changing_part)
+		HeapTupleHeaderSetBlockNumber(tp.t_data, InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3445,7 +3459,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -5956,6 +5970,7 @@ l4:
 next:
 		/* if we find the end of update chain, we're done. */
 		if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
+			!HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
 			ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
 			HeapTupleHeaderIsOnlyLocked(mytup.t_data))
 		{
@@ -6007,7 +6022,8 @@ static HTSU_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, ItemPointer ctid,
 						TransactionId xid, LockTupleMode mode)
 {
-	if (!ItemPointerEquals(&tuple->t_self, ctid))
+	if (ItemPointerValidBlockNumber(ctid) &&
+		!ItemPointerEquals(&tuple->t_self, ctid))
 	{
 		/*
 		 * If this is the first possibly-multixact-able operation in the
@@ -9322,6 +9338,21 @@ heap_mask(char *pagedata, BlockNumber blkno)
 			 */
 			if (HeapTupleHeaderIsSpeculative(page_htup))
 				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+			/*
+			 * For a deleted tuple, a block identifier is set to
+			 * InvalidBlockNumber to indicate that the tuple has been moved to
+			 * another partition due to an update of the partition key.
+			 *
+			 * During redo, heap_xlog_delete sets t_ctid to current block
+			 * number and self offset number.  It doesn't verify the tuple is
+			 * deleted by usual delete/update or deleted by the update of the
+			 * partition key on the master.  Hence, like speculative tuple, to
+			 * ignore any inconsistency set block identifier to current block
+			 * number.
+			 */
+			if (!HeapTupleHeaderValidBlockNumber(page_htup))
+				HeapTupleHeaderSetBlockNumber(page_htup, blkno);
 		}
 
 		/*
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2588..0943d95ea1 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -424,6 +424,7 @@ rewrite_heap_tuple(RewriteState state,
 	 */
 	if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
 		  HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
+		HeapTupleHeaderValidBlockNumber(old_tuple->t_data) &&
 		!(ItemPointerEquals(&(old_tuple->t_self),
 							&(old_tuple->t_data->t_ctid))))
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index fbd176b5d0..93c1f2a51f 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 91ba939bdc..884c012e5b 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2712,6 +2712,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!ItemPointerValidBlockNumber(&hufd.ctid))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
@@ -2780,7 +2784,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 		 * As above, it should be safe to examine xmax and t_ctid without the
 		 * buffer content lock, because they can't be changing.
 		 */
-		if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+		if (!HeapTupleHeaderValidBlockNumber(tuple.t_data) ||
+			ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
 		{
 			/* deleted, so forget about it */
 			ReleaseBuffer(buffer);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 32891abbdf..8430420de7 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -190,10 +190,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -348,10 +353,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b39ccf7dc1..dd4d5f25ca 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c32928d9bd..eae2b4c732 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -719,7 +719,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tupleDeleted,
 		   bool processReturning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool changing_part)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -810,7 +811,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 changing_part);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -856,6 +858,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1158,7 +1165,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1303,6 +1310,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1473,6 +1485,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of the partition key.
+			 */
+			Assert(ItemPointerValidBlockNumber(&hufd.ctid));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2062,7 +2082,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..e8da83c303 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool changing_part);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 2ab1815390..12df36c70b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -441,6 +441,12 @@ do { \
 	ItemPointerSet(&(tup)->t_ctid, token, SpecTokenOffsetNumber) \
 )
 
+#define HeapTupleHeaderSetBlockNumber(tup, blkno) \
+		ItemPointerSetBlockNumber(&(tup)->t_ctid, blkno)
+
+#define HeapTupleHeaderValidBlockNumber(tup) \
+		ItemPointerValidBlockNumber(&(tup)->t_ctid)
+
 #define HeapTupleHeaderGetDatumLength(tup) \
 	VARSIZE(tup)
 
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..131ec518bf 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * the partition key.
  *
  * Note: because there is an item pointer in each tuple header and index
  * tuple header on disk, it's very important not to waste space with
@@ -60,6 +62,13 @@ typedef ItemPointerData *ItemPointer;
 #define ItemPointerIsValid(pointer) \
 	((bool) (PointerIsValid(pointer) && ((pointer)->ip_posid != 0)))
 
+/*
+ * ItemPointerIsValid
+ *		True iff the block number of the item pointer is valid.
+ */
+#define ItemPointerValidBlockNumber(pointer) \
+	((bool) (BlockNumberIsValid(ItemPointerGetBlockNumberNoCheck(pointer))))
+
 /*
  * ItemPointerGetBlockNumberNoCheck
  *		Returns the block number of a disk item pointer.
-- 
2.14.1

0002-isolation-tests-v6.patchapplication/octet-stream; name=0002-isolation-tests-v6.patchDownload

From d5b5086cea1e1928a624b41c02ac78418aa062a8 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:33 +0530
Subject: [PATCH 2/2] isolation tests v6

v6: Minor changes in the regressions test.
 - replace 'BEGIN' setup step with 'BEGIN ISOLATION LEVEL READ COMMITTED'

v5:
 - As per Andres Freund suggestion[4], added test for ON CONFLICT DO
   NOTHING

 - TODO:
    1. Cannot add ON CONFLICT DO UPDATE test since it's not supported
       for partitioned table, may be after proposed patch[5]

v4:
 - Rebased on Invalidate ip_blkid v5.

v3:
 - Rebase on "UPDATE of partition key v35" patch[2] and
  latest maste head[3].

v2:
 - Error message changed.
 - Can't add isolation test[1] for
 	RelationFindReplTupleByIndex & RelationFindReplTupleSeq
 - In ExecOnConflictUpdate, the error report is converted to assert
   check.

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

 ------------
  References:
 ------------
 1] https://postgr.es/m/CA+TgmoYsMRo2PHFTGUFifv4ZSCZ9LNJASbOyb=9it2=UA4j4vw@mail.gmail.com
 2] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 3] Commit id bdb70c12b3a2e69eec6e51411df60d9f43ecc841
 4] https://postgr.es/m/20180305232353.gpue7jldnm4bjf4i@alap3.anarazel.de
 5] https://postgr.es/m/20180228004602.cwdyralmg5ejdqkq@alvherre.pgsql

fixup! isolation tests v5
---
 .../isolation/expected/partition-key-update-1.out  |  43 +++++++
 .../isolation/expected/partition-key-update-2.out  |  23 ++++
 .../isolation/expected/partition-key-update-3.out  |   9 ++
 .../isolation/expected/partition-key-update-4.out  |  29 +++++
 .../isolation/expected/partition-key-update-5.out  | 139 +++++++++++++++++++++
 src/test/isolation/isolation_schedule              |   5 +
 .../isolation/specs/partition-key-update-1.spec    |  39 ++++++
 .../isolation/specs/partition-key-update-2.spec    |  41 ++++++
 .../isolation/specs/partition-key-update-3.spec    |  32 +++++
 .../isolation/specs/partition-key-update-4.spec    |  45 +++++++
 .../isolation/specs/partition-key-update-5.spec    |  44 +++++++
 11 files changed, 449 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/expected/partition-key-update-4.out
 create mode 100644 src/test/isolation/expected/partition-key-update-5.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-4.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-5.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..bfbeccc852
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,43 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2u s1c s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2u s1u s2c s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2d s1c s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2d s1u s2c s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..06460a8da7
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,23 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u s2c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2u s1c s2c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2u s1u s2c s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+error in steps s2c s1u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..1be63dfb8b
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,9 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c s2c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 0000000000..363de0d69c
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,29 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1u s2donothing s3donothing s1c s2c s3select s3c
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
+
+starting permutation: s2donothing s1u s3donothing s1c s2c s3select s3c
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-5.out b/src/test/isolation/expected/partition-key-update-5.out
new file mode 100644
index 0000000000..42dfe64ad3
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-5.out
@@ -0,0 +1,139 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59546..26f88c50b6 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -66,3 +66,8 @@ test: async-notify
 test: vacuum-reltuples
 test: timeouts
 test: vacuum-concurrent-drop
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
+test: partition-key-update-4
+test: partition-key-update-5
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..32d555c37c
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,39 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u" "s1c" "s2u" "s2c"
+permutation "s1u" "s2u" "s1c" "s2c"
+permutation "s2u" "s1u" "s2c" "s1c"
+
+permutation "s1u" "s1c" "s2d" "s2c"
+permutation "s1u" "s2d" "s1c" "s2c"
+permutation "s2d" "s1u" "s2c" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..8a952892c2
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,41 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u" "s1c" "s2u" "s2c"
+permutation "s1u" "s2u" "s1c" "s2c"
+permutation "s2u" "s1u" "s2c" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..1baa0159de
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,32 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u3" "s2i" "s1c" "s2c"
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 0000000000..699e2e727f
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,45 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING test
+#
+# This test tries to expose problems with the interaction between concurrent
+# sessions during an update of the partition key and INSERT...ON CONFLICT DO
+# NOTHING on a partitioned table.
+#
+# The convention here is that session 1 moves row from one partition to
+# another due update of the partition key and session 2 always ends up
+# inserting, and session 3 always ends up doing nothing.
+#
+# Note: This test is slightly resemble to insert-conflict-do-nothing test.
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c"	{ COMMIT; }
+
+session "s3"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; }
+step "s3select" { SELECT * FROM foo ORDER BY a; }
+step "s3c"	{ COMMIT; }
+
+# Regular case where one session block-waits on another to determine if it
+# should proceed with an insert or do nothing.
+permutation "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3select" "s3c"
+permutation "s2donothing" "s1u" "s3donothing" "s1c" "s2c" "s3select" "s3c"
diff --git a/src/test/isolation/specs/partition-key-update-5.spec b/src/test/isolation/specs/partition-key-update-5.spec
new file mode 100644
index 0000000000..a6efea1381
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-5.spec
@@ -0,0 +1,44 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING
+# test on partitioned table with multiple rows in higher isolation levels.
+#
+# Note: This test is resemble to insert-conflict-do-nothing-2 test
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s2begins"	{ BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c" { COMMIT; }
+step "s2select" { SELECT * FROM foo ORDER BY a; }
+
+session "s3"
+step "s3beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s3begins" { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; }
+step "s3c" { COMMIT; }
+
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
-- 
2.14.1

#57

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: amul sul (#56)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Mon, Mar 12, 2018 at 6:33 PM, amul sul <sulamul@gmail.com> wrote:

On Mon, Mar 12, 2018 at 11:45 AM, amul sul <sulamul@gmail.com> wrote:

On Sat, Mar 10, 2018 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

complete CTID.

Sure, will do that.

I did the aforementioned changes in the attached patch, thanks.

--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * the partition key.

I think instead of updating this description in itemptr.h, you should
update it in htup_details.h where we already have a description of
t_ctid. After this patch, the t_ctid column value in heap_page_items
function will show it as InvalidBlockNumber and in the documentation,
we have given the reference of htup_details.h. Other than that the
latest version looks good to me.

I have marked this patch as RFC as this is a small change, hope you
can update the patch soon.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#58

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#57)

2 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Sat, Mar 17, 2018 at 4:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 12, 2018 at 6:33 PM, amul sul <sulamul@gmail.com> wrote:

On Mon, Mar 12, 2018 at 11:45 AM, amul sul <sulamul@gmail.com> wrote:

On Sat, Mar 10, 2018 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

complete CTID.

Sure, will do that.

I did the aforementioned changes in the attached patch, thanks.
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
* This is a pointer to an item within a disk page of a known file
* (for example, a cross-link from an index to its parent table).
* blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * the partition key.
I think instead of updating this description in itemptr.h, you should
update it in htup_details.h where we already have a description of
t_ctid. After this patch, the t_ctid column value in heap_page_items
function will show it as InvalidBlockNumber and in the documentation,
we have given the reference of htup_details.h. Other than that the
latest version looks good to me.

Okay, fixed in the attached version.

I have marked this patch as RFC as this is a small change, hope you
can update the patch soon.

Thank you, updated patch attached.

Regards,
Amul

Attachments:

0001-Invalidate-ip_blkid-v7.patchapplication/octet-stream; name=0001-Invalidate-ip_blkid-v7.patchDownload

From e4c78647c044b818ad4694e0eeb42db3aac4e9d2 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Mon, 19 Mar 2018 11:31:18 +0530
Subject: [PATCH 1/2] Invalidate ip_blkid v7

v7: Update w.r.t Amit Kapila's review comments[12]
 - Reverted ItemPointerData comment changes did in v2 patch.
 - Updated t_ctid comment in htup_details.h.

v6:
 - Changes w.r.t Amit's suggestion[11].
 - AFAICS, Other than previous places(v6-wip), I didn't found any
   other places where we need the check for invalid block numberother than.
 - Minor change in the comments : 'partition key' -> 'the partition key'
 - ran pgindent.

v6-wip: Update w.r.t Andres Freund review comments[8]
 - Added HeapTupleHeaderValidBlockNumber, HeapTupleHeaderSetBlockNumber
   and ItemPointerValidBlockNumber macro.
 - Fixed comments as per Andres' suggestions.
 - Also added invalid block number check in heap_get_latest_tid as
   discussed in the thread[9], similar change made in heap_lock_updated_tuple_rec.
 - In heap_lock_updated_tuple, rewrite_heap_tuple & EvalPlanQualFetch,
   I've added check for invalid block number where ItemPointerEquals used
   to conclude tuple has been updated/deleted.

 TODO:
 1. Yet to test changes made in the heap_lock_updated_tuple,
    rewrite_heap_tuple & EvalPlanQualFetch function are valid or not.
    Also, address TODO tags in the same function.
 2. Also check there are other places where similar changes like
    above needed(wrt Pavan Deolasee[10] concern)

v5:
 - Added code in heap_mask to skip wal_consistency_checking[7]
 - Fixed previous todos.

v5-wip2:
 - Minor changes in RelationFindReplTupleByIndex() and
   RelationFindReplTupleSeq()

 - TODO;
   Same as the privious

v5-wip: Update w.r.t Amit Kapila's comments[6].
 - Reverted error message in nodeModifyTable.c from 'tuple to be locked'
   to 'tuple to be updated'.

 - TODO:
 1. Yet to made a decision of having LOG/ELOG/ASSERT in the
    RelationFindReplTupleByIndex() and RelationFindReplTupleSeq().

v4: Rebased on "UPDATE of partition key v35" patch[5].

v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
 - typo in all error message and comment : "to an another" -> "to another"
 - error message change : "tuple to be updated" -> "tuple to be locked"
 - In ExecOnConflictUpdate(), error report converted into assert &
  comments added.

v2: Updated w.r.t Robert review comments[2]
 - Updated couple of comment of heap_delete argument and ItemPointerData
 - Added same concurrent update error logic in ExecOnConflictUpdate,
   RelationFindReplTupleByIndex and RelationFindReplTupleSeq

v1: Initial version -- as per Amit Kapila's suggestions[1]
 - When tuple is being moved to another partition then ip_blkid in the
   tuple header mark to InvalidBlockNumber.

 -------------
  References:
 -------------
 1] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com
 2] https://postgr.es/m/CA%2BTgmoYY98AEjh7RDtuzaLC--_0smCozXRu6bFmZTaX5Ne%3DB5Q%40mail.gmail.com
 3] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
 4] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql
 5] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 6] https://postgr.es/m/CAA4eK1LHVnNWYF53F1gUGx6CTxuvznozvU-Lr-dfE=Qeu1gEcg@mail.gmail.com
 7] https://postgr.es/m/CAAJ_b94_29wiUA83W8LQjtfjv9XNV=+PT8+ioWRPjnnFHe3eqw@mail.gmail.com
 8] https://postgr.es/m/20180305232353.gpue7jldnm4bjf4i@alap3.anarazel.de
 9] https://postgr.es/m/CAAJ_b97BBkRWFowGRs9VNzFykoK0ikGB1yYEsWfYK8xR5enSrw@mail.gmail.com
10] https://postgr.es/m/CABOikdPXwqkLGgTZZm2qYwTn4L69V36rCh55fFma1fAYbon7Vg@mail.gmail.com
11] https://postgr.es/m/CAAJ_b97ohg=+WfiFT3g2x14rvXsXOFXvjH43GkYbgcLZvF7k+w@mail.gmail.com
12] https://postgr.es/m/CAA4eK1LbML3=uruBA46qGA9ZE_F3Co8MsjefLrJoLz+Q_TW3vg@mail.gmail.com
---
 src/backend/access/heap/heapam.c       | 37 +++++++++++++++++++++++++++++++---
 src/backend/access/heap/rewriteheap.c  |  1 +
 src/backend/commands/trigger.c         |  5 +++++
 src/backend/executor/execMain.c        |  7 ++++++-
 src/backend/executor/execReplication.c | 26 ++++++++++++++++--------
 src/backend/executor/nodeLockRows.c    |  5 +++++
 src/backend/executor/nodeModifyTable.c | 28 +++++++++++++++++++++----
 src/include/access/heapam.h            |  2 +-
 src/include/access/htup_details.h      | 12 +++++++++--
 src/include/storage/itemptr.h          |  7 +++++++
 10 files changed, 111 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c08ab14c02..54e6c6bde0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2304,6 +2304,7 @@ heap_get_latest_tid(Relation relation,
 		 */
 		if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
 			HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
+			!HeapTupleHeaderValidBlockNumber(tp.t_data) ||
 			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
 		{
 			UnlockReleaseBuffer(buffer);
@@ -2314,6 +2315,9 @@ heap_get_latest_tid(Relation relation,
 		priorXmax = HeapTupleHeaderGetUpdateXid(tp.t_data);
 		UnlockReleaseBuffer(buffer);
 	}							/* end of loop */
+
+	/* Make sure that the return value has a valid block number */
+	Assert(ItemPointerValidBlockNumber(tid));
 }
 
 
@@ -3037,6 +3041,9 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	changing_part - true iff the tuple is being moved to another partition
+ *					table due to an update of the partition key. Otherwise,
+ *					false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3052,7 +3059,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool changing_part)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3320,6 +3327,13 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/*
+	 * Sets a block identifier to the InvalidBlockNumber to indicate such an
+	 * update being moved tuple to another partition.
+	 */
+	if (changing_part)
+		HeapTupleHeaderSetBlockNumber(tp.t_data, InvalidBlockNumber);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3445,7 +3459,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -5956,6 +5970,7 @@ l4:
 next:
 		/* if we find the end of update chain, we're done. */
 		if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
+			!HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
 			ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
 			HeapTupleHeaderIsOnlyLocked(mytup.t_data))
 		{
@@ -6007,7 +6022,8 @@ static HTSU_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, ItemPointer ctid,
 						TransactionId xid, LockTupleMode mode)
 {
-	if (!ItemPointerEquals(&tuple->t_self, ctid))
+	if (ItemPointerValidBlockNumber(ctid) &&
+		!ItemPointerEquals(&tuple->t_self, ctid))
 	{
 		/*
 		 * If this is the first possibly-multixact-able operation in the
@@ -9322,6 +9338,21 @@ heap_mask(char *pagedata, BlockNumber blkno)
 			 */
 			if (HeapTupleHeaderIsSpeculative(page_htup))
 				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+			/*
+			 * For a deleted tuple, a block identifier is set to
+			 * InvalidBlockNumber to indicate that the tuple has been moved to
+			 * another partition due to an update of the partition key.
+			 *
+			 * During redo, heap_xlog_delete sets t_ctid to current block
+			 * number and self offset number.  It doesn't verify the tuple is
+			 * deleted by usual delete/update or deleted by the update of the
+			 * partition key on the master.  Hence, like speculative tuple, to
+			 * ignore any inconsistency set block identifier to current block
+			 * number.
+			 */
+			if (!HeapTupleHeaderValidBlockNumber(page_htup))
+				HeapTupleHeaderSetBlockNumber(page_htup, blkno);
 		}
 
 		/*
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2588..0943d95ea1 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -424,6 +424,7 @@ rewrite_heap_tuple(RewriteState state,
 	 */
 	if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
 		  HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
+		HeapTupleHeaderValidBlockNumber(old_tuple->t_data) &&
 		!(ItemPointerEquals(&(old_tuple->t_self),
 							&(old_tuple->t_data->t_ctid))))
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index fbd176b5d0..93c1f2a51f 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3071,6 +3071,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 91ba939bdc..884c012e5b 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2712,6 +2712,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (!ItemPointerValidBlockNumber(&hufd.ctid))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
@@ -2780,7 +2784,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 		 * As above, it should be safe to examine xmax and t_ctid without the
 		 * buffer content lock, because they can't be changing.
 		 */
-		if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+		if (!HeapTupleHeaderValidBlockNumber(tuple.t_data) ||
+			ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
 		{
 			/* deleted, so forget about it */
 			ReleaseBuffer(buffer);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 32891abbdf..8430420de7 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -190,10 +190,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -348,10 +353,15 @@ retry:
 			case HeapTupleMayBeUpdated:
 				break;
 			case HeapTupleUpdated:
-				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					/* XXX: Improve handling here */
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b39ccf7dc1..dd4d5f25ca 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 3332ae4bf3..784695aa1d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -719,7 +719,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tupleDeleted,
 		   bool processReturning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool changing_part)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -810,7 +811,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 changing_part);
 		switch (result)
 		{
 			case HeapTupleSelfUpdated:
@@ -856,6 +858,11 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1158,7 +1165,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, true);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1303,6 +1310,11 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (!ItemPointerValidBlockNumber(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
 					TupleTableSlot *epqslot;
@@ -1473,6 +1485,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of the partition key.
+			 */
+			Assert(ItemPointerValidBlockNumber(&hufd.ctid));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2062,7 +2082,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, false);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..e8da83c303 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -156,7 +156,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool changing_part);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 2ab1815390..964104b1d1 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -83,8 +83,10 @@
  *
  * A word about t_ctid: whenever a new tuple is stored on disk, its t_ctid
  * is initialized with its own TID (location).  If the tuple is ever updated,
- * its t_ctid is changed to point to the replacement version of the tuple.
- * Thus, a tuple is the latest version of its row iff XMAX is invalid or
+ * its t_ctid is changed to point to the replacement version of the tuple or
+ * the block number (ip_blkid) is invalidated if the tuple is moved from one
+ * partition to another partition relation due to an update of the partition
+ * key.  Thus, a tuple is the latest version of its row iff XMAX is invalid or
  * t_ctid points to itself (in which case, if XMAX is valid, the tuple is
  * either locked or deleted).  One can follow the chain of t_ctid links
  * to find the newest version of the row.  Beware however that VACUUM might
@@ -441,6 +443,12 @@ do { \
 	ItemPointerSet(&(tup)->t_ctid, token, SpecTokenOffsetNumber) \
 )
 
+#define HeapTupleHeaderSetBlockNumber(tup, blkno) \
+		ItemPointerSetBlockNumber(&(tup)->t_ctid, blkno)
+
+#define HeapTupleHeaderValidBlockNumber(tup) \
+		ItemPointerValidBlockNumber(&(tup)->t_ctid)
+
 #define HeapTupleHeaderGetDatumLength(tup) \
 	VARSIZE(tup)
 
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..b71449e712 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -60,6 +60,13 @@ typedef ItemPointerData *ItemPointer;
 #define ItemPointerIsValid(pointer) \
 	((bool) (PointerIsValid(pointer) && ((pointer)->ip_posid != 0)))
 
+/*
+ * ItemPointerIsValid
+ *		True iff the block number of the item pointer is valid.
+ */
+#define ItemPointerValidBlockNumber(pointer) \
+	((bool) (BlockNumberIsValid(ItemPointerGetBlockNumberNoCheck(pointer))))
+
 /*
  * ItemPointerGetBlockNumberNoCheck
  *		Returns the block number of a disk item pointer.
-- 
2.14.1

0002-isolation-tests-v6.patchapplication/octet-stream; name=0002-isolation-tests-v6.patchDownload

From 6f1699a2f9d9ab06091817b90e2728a649b72fb5 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Tue, 13 Feb 2018 12:37:33 +0530
Subject: [PATCH 2/2] isolation tests v6

v6: Minor changes in the regressions test.
 - replace 'BEGIN' setup step with 'BEGIN ISOLATION LEVEL READ COMMITTED'

v5:
 - As per Andres Freund suggestion[4], added test for ON CONFLICT DO
   NOTHING

 - TODO:
    1. Cannot add ON CONFLICT DO UPDATE test since it's not supported
       for partitioned table, may be after proposed patch[5]

v4:
 - Rebased on Invalidate ip_blkid v5.

v3:
 - Rebase on "UPDATE of partition key v35" patch[2] and
  latest maste head[3].

v2:
 - Error message changed.
 - Can't add isolation test[1] for
 	RelationFindReplTupleByIndex & RelationFindReplTupleSeq
 - In ExecOnConflictUpdate, the error report is converted to assert
   check.

v1:
 Added isolation tests to hit an error in the following functions:
 1. ExecUpdate  	-> specs/partition-key-update-1
 2. ExecDelete		-> specs/partition-key-update-1
 3. GetTupleForTrigger	-> specs/partition-key-update-2
 4. ExecLockRows	-> specs/partition-key-update-3

 ------------
  References:
 ------------
 1] https://postgr.es/m/CA+TgmoYsMRo2PHFTGUFifv4ZSCZ9LNJASbOyb=9it2=UA4j4vw@mail.gmail.com
 2] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com
 3] Commit id bdb70c12b3a2e69eec6e51411df60d9f43ecc841
 4] https://postgr.es/m/20180305232353.gpue7jldnm4bjf4i@alap3.anarazel.de
 5] https://postgr.es/m/20180228004602.cwdyralmg5ejdqkq@alvherre.pgsql

fixup! isolation tests v5
---
 .../isolation/expected/partition-key-update-1.out  |  43 +++++++
 .../isolation/expected/partition-key-update-2.out  |  23 ++++
 .../isolation/expected/partition-key-update-3.out  |   9 ++
 .../isolation/expected/partition-key-update-4.out  |  29 +++++
 .../isolation/expected/partition-key-update-5.out  | 139 +++++++++++++++++++++
 src/test/isolation/isolation_schedule              |   5 +
 .../isolation/specs/partition-key-update-1.spec    |  39 ++++++
 .../isolation/specs/partition-key-update-2.spec    |  41 ++++++
 .../isolation/specs/partition-key-update-3.spec    |  32 +++++
 .../isolation/specs/partition-key-update-4.spec    |  45 +++++++
 .../isolation/specs/partition-key-update-5.spec    |  44 +++++++
 11 files changed, 449 insertions(+)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/expected/partition-key-update-4.out
 create mode 100644 src/test/isolation/expected/partition-key-update-5.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-4.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-5.spec

diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..bfbeccc852
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,43 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2u s1c s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2u s1u s2c s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2d s1c s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2d s1u s2c s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..06460a8da7
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,23 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u s2c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2u s1c s2c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2u s1u s2c s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+error in steps s2c s1u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..1be63dfb8b
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,9 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c s2c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 0000000000..363de0d69c
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,29 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1u s2donothing s3donothing s1c s2c s3select s3c
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
+
+starting permutation: s2donothing s1u s3donothing s1c s2c s3select s3c
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-5.out b/src/test/isolation/expected/partition-key-update-5.out
new file mode 100644
index 0000000000..42dfe64ad3
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-5.out
@@ -0,0 +1,139 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59546..26f88c50b6 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -66,3 +66,8 @@ test: async-notify
 test: vacuum-reltuples
 test: timeouts
 test: vacuum-concurrent-drop
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
+test: partition-key-update-4
+test: partition-key-update-5
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..32d555c37c
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,39 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u" "s1c" "s2u" "s2c"
+permutation "s1u" "s2u" "s1c" "s2c"
+permutation "s2u" "s1u" "s2c" "s1c"
+
+permutation "s1u" "s1c" "s2d" "s2c"
+permutation "s1u" "s2d" "s1c" "s2c"
+permutation "s2d" "s1u" "s2c" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..8a952892c2
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,41 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u" "s1c" "s2u" "s2c"
+permutation "s1u" "s2u" "s1c" "s2c"
+permutation "s2u" "s1u" "s2c" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..1baa0159de
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,32 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u3" "s2i" "s1c" "s2c"
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 0000000000..699e2e727f
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,45 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING test
+#
+# This test tries to expose problems with the interaction between concurrent
+# sessions during an update of the partition key and INSERT...ON CONFLICT DO
+# NOTHING on a partitioned table.
+#
+# The convention here is that session 1 moves row from one partition to
+# another due update of the partition key and session 2 always ends up
+# inserting, and session 3 always ends up doing nothing.
+#
+# Note: This test is slightly resemble to insert-conflict-do-nothing test.
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c"	{ COMMIT; }
+
+session "s3"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; }
+step "s3select" { SELECT * FROM foo ORDER BY a; }
+step "s3c"	{ COMMIT; }
+
+# Regular case where one session block-waits on another to determine if it
+# should proceed with an insert or do nothing.
+permutation "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3select" "s3c"
+permutation "s2donothing" "s1u" "s3donothing" "s1c" "s2c" "s3select" "s3c"
diff --git a/src/test/isolation/specs/partition-key-update-5.spec b/src/test/isolation/specs/partition-key-update-5.spec
new file mode 100644
index 0000000000..a6efea1381
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-5.spec
@@ -0,0 +1,44 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING
+# test on partitioned table with multiple rows in higher isolation levels.
+#
+# Note: This test is resemble to insert-conflict-do-nothing-2 test
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s2begins"	{ BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c" { COMMIT; }
+step "s2select" { SELECT * FROM foo ORDER BY a; }
+
+session "s3"
+step "s3beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s3begins" { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; }
+step "s3c" { COMMIT; }
+
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
-- 
2.14.1

#59

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Tom Lane (#47)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Hi,

On 2018-03-08 13:46:53 -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I would also vote for (1), especially if the only way to do (2)
is stuff as outright scary as this. I would far rather have (3) than
this, because IMO, what we are looking at right now is going to make
the fallout from multixacts look like a pleasant day at the beach.

Whoa. Well, that would clearly be bad, but I don't understand why you
find this so scary. Can you explain further?

Possibly I'm crying wolf; it's hard to be sure. But I recall that nobody
was particularly afraid of multixacts when that went in, and look at all
the trouble we've had with that. Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble. There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

Given, as explained nearby, we already do store transient data in the
ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
even a whiff of trouble, I'm currently not inclined to see a huge issue
here. It'd be great if you could expand on your concerns here a bit, we
gotta figure out a way forward.

I think the proposed patch needs some polish (I'm e.g. unhappy with the
naming of the new macros etc), but I think otherwise it's a reasonable
attempt at solving the problem.

I would've been happier about expending an infomask bit towards this
purpose. Just eyeing what we've got, I can't help noticing that
HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
in a partitioned table. Perhaps making these tests depend on
partitioned-ness would be unworkably messy, but it's worth thinking
about.

They previously couldn't be set together IIRC, so we could just (mask &
(HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
that'd be permanently eating two infomask bits. For something that
doesn't in general have to be able to live on tuples, just on (at?) the
"deleted tuple at end of a chain".

Greetings,

Andres Freund

#60

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Andres Freund (#59)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Andres Freund <andres@anarazel.de> writes:

On 2018-03-08 13:46:53 -0500, Tom Lane wrote:

... Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble. There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

Given, as explained nearby, we already do store transient data in the
ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
even a whiff of trouble, I'm currently not inclined to see a huge issue
here. It'd be great if you could expand on your concerns here a bit, we
gotta figure out a way forward.

Just what I said. There's a lot of code that knows how to follow tuple
update chains, probably not all of it in core, and this will break it.
But only in seldom-exercised corner cases, which is the worst of all
possible worlds from a reliability standpoint. (I don't think ON CONFLICT
is a counterexample because, IIUC, it's not a persistent state.)

Given that there are other ways we could attack it, I think throwing away
this particular invariant is an unnecessarily risky solution.

I would've been happier about expending an infomask bit towards this
purpose. Just eyeing what we've got, I can't help noticing that
HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
in a partitioned table. Perhaps making these tests depend on
partitioned-ness would be unworkably messy, but it's worth thinking
about.

They previously couldn't be set together IIRC, so we could just (mask &
(HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
that'd be permanently eating two infomask bits.

Hmm. That objection only matters if we have realistic intentions of
reclaiming those bits in future, which I've not heard anyone making
serious effort towards. Rather than messing with the definition of ctid,
I'd be happier with saying that they're never going to be reclaimed, but
at least we're getting one bit's worth of real use out of them.

regards, tom lane

#61

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Tom Lane (#60)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Hi,

On 2018-03-28 13:52:24 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Given, as explained nearby, we already do store transient data in the
ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
even a whiff of trouble, I'm currently not inclined to see a huge issue
here. It'd be great if you could expand on your concerns here a bit, we
gotta figure out a way forward.

Just what I said. There's a lot of code that knows how to follow tuple
update chains, probably not all of it in core, and this will break it.
But only in seldom-exercised corner cases, which is the worst of all
possible worlds from a reliability standpoint.

How will it break it? They'll see an invalid ctid and conclude that the
tuple is dead? Without any changes that's already something that can
happen if a later tuple in the chain has been pruned away. Sure, that
code won't realize it should error out because the tuple is now in a
different partition, but neither would a infomask bit.

I think my big problem is that I just don't see what the worst that can
happen is. We'd potentially see a broken ctid chain, something that very
commonly happens, and consider the tuple to be invisible. That seems
pretty sane behaviour for unadapted code, and not any worse than other
potential solutions.

(I don't think ON CONFLICT is a counterexample because, IIUC, it's not
a persistent state.)

Hm, it can be persistent if we error out in the right moment. But it's
nots super common to encounter that over a long time, I grant you
that. Not that this'd be super persistent either, vacuum/pruning would
normally remove the tuple as well, it's dead after all.

I would've been happier about expending an infomask bit towards this
purpose. Just eyeing what we've got, I can't help noticing that
HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
in a partitioned table. Perhaps making these tests depend on
partitioned-ness would be unworkably messy, but it's worth thinking
about.

They previously couldn't be set together IIRC, so we could just (mask &
(HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
that'd be permanently eating two infomask bits.

Hmm. That objection only matters if we have realistic intentions of
reclaiming those bits in future, which I've not heard anyone making
serious effort towards.

I plan to submit a patch early v12 that keeps track of the last time a
table has been fully scanned (and when it was created). With part of the
goal being debuggability and part being able to reclaim things like
these bits.

Greetings,

Andres Freund

#62

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Andres Freund (#61)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:

How will it break it? They'll see an invalid ctid and conclude that the
tuple is dead? Without any changes that's already something that can
happen if a later tuple in the chain has been pruned away. Sure, that
code won't realize it should error out because the tuple is now in a
different partition, but neither would a infomask bit.

I think my big problem is that I just don't see what the worst that can
happen is. We'd potentially see a broken ctid chain, something that very
commonly happens, and consider the tuple to be invisible. That seems
pretty sane behaviour for unadapted code, and not any worse than other
potential solutions.

This is more or less my feeling as well. I think it's better to
conserve our limited supply of infomask bits as much as we can, and I
do think that we should try to reclaimed HEAP_MOVED_IN and
HEAP_MOVED_OFF in the future instead of defining the combination of
the two of them to mean something now.

The only scenario in which I can see this patch really leading to
disaster is if there's some previous release out there where the bit
pattern chosen by this patch has some other, incompatible meaning. As
far as we know, that's not the case: this bit pattern was previously
unused. Code seeing that bit pattern could potentially therefore (1)
barf on the valid CTID, but the whole point of this is to throw an
ERROR anyway, so if that happens then we're getting basically the
right behavior with the wrong error message or (2) just treat it as a
broken CTID link, in which case the result should be pretty much the
same as if this patch hadn't been committed in the first place.

Where the multixact patch really caused us a lot of trouble is that
the implications weren't just for the heap itself -- the relevant
SLRUs became subject to new retention requirements which in turn
affected vacuum, autovacuum, and checkpoint behavior. There is no
similar problem here -- the flag indicating the problematic situation,
however it ends up being stored, doesn't point to any external data.
Now, that doesn't mean there can't be some other kind of problem with
this patch, but I don't think that we should block the patch on the
theory that it might have an undiscovered problem that destroys the
entire PostgreSQL ecosystem with no theory as to what that problem
might actually be. Modulo implementation quality, I think the risk
level of this patch is somewhat but not vastly higher than
37484ad2aacef5ec794f4dd3d5cf814475180a78, which similarly defined a
previously-unused bit pattern in the tuple header. The reason I think
this one might be somewhat riskier is because AFAICS it's not so easy
to make sure we've found all the code, even in core, that might care,
as it was in that case; and also because updates happen more than
freezing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#63

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Amit Kapila (#28)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On 2018-03-06 19:57:03 +0530, Amit Kapila wrote:

On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+                                     ereport(ERROR,
+                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                                      errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
ERRCODE_T_R_SERIALIZATION_FAILURE? A lot of frameworks have builtin
logic to retry serialization failures, and this kind of thing is going
to resolved by retrying, no?
I think it depends, in some cases retry can help in deleting the
required tuple, but in other cases like when the user tries to perform
delete on a particular partition table, it won't be successful as the
tuple would have been moved.

So? In that case the retry will not find the tuple, which'll also
resolve the issue. Preventing frameworks from dealing with this seems
like a way worse issue than that.

Greetings,

Andres Freund

#64

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Andres Freund (#63)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Wed, Apr 4, 2018 at 4:31 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-03-06 19:57:03 +0530, Amit Kapila wrote:
On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 7961b4be6a..b07b7092de 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
errmsg("could not serialize access due to concurrent update")));
+                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+                                     ereport(ERROR,
+                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                                      errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
ERRCODE_T_R_SERIALIZATION_FAILURE? A lot of frameworks have builtin
logic to retry serialization failures, and this kind of thing is going
to resolved by retrying, no?
I think it depends, in some cases retry can help in deleting the
required tuple, but in other cases like when the user tries to perform
delete on a particular partition table, it won't be successful as the
tuple would have been moved.
So? In that case the retry will not find the tuple, which'll also
resolve the issue. Preventing frameworks from dealing with this seems
like a way worse issue than that.

The idea was just that the apps should get an error so that they can
take appropriate action (either retry or whatever they want), we don't
want to silently make it a no-delete op. The current error patch is
throwing appears similar to what we already do in delete/update
operation with a difference that here we are trying to delete a moved
tuple.

heap_delete()
{
..
if (result == HeapTupleInvisible)
{
UnlockReleaseBuffer(buffer);
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("attempted to delete invisible tuple")));
}
..
}

I think if we want users to always retry on this operation, then
ERRCODE_T_R_SERIALIZATION_FAILURE is a better error code.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#65

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Robert Haas (#62)

1 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Hi,

On 2018-04-02 11:26:38 -0400, Robert Haas wrote:

On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:

How will it break it? They'll see an invalid ctid and conclude that the
tuple is dead? Without any changes that's already something that can
happen if a later tuple in the chain has been pruned away. Sure, that
code won't realize it should error out because the tuple is now in a
different partition, but neither would a infomask bit.

I think my big problem is that I just don't see what the worst that can
happen is. We'd potentially see a broken ctid chain, something that very
commonly happens, and consider the tuple to be invisible. That seems
pretty sane behaviour for unadapted code, and not any worse than other
potential solutions.

This is more or less my feeling as well. I think it's better to
conserve our limited supply of infomask bits as much as we can, and I
do think that we should try to reclaimed HEAP_MOVED_IN and
HEAP_MOVED_OFF in the future instead of defining the combination of
the two of them to mean something now.

Yep.

It'd also make locking more complicated or require to keep more
information around in HeapUpdateFailureData. In a number of places we
currently release the buffer pin before switching over heap_lock_tuple
etc results, or there's not even a way to get at the infomask currently
(heap_update failing).

Modulo implementation quality, I think the risk
level of this patch is somewhat but not vastly higher than
37484ad2aacef5ec794f4dd3d5cf814475180a78, which similarly defined a
previously-unused bit pattern in the tuple header.

Personally I think that change was vastly riskier, because it affected
freezing and wraparounds. Which is something we've repeatedly gotten
wrong.

The reason I think this one might be somewhat riskier is because
AFAICS it's not so easy to make sure we've found all the code, even in
core, that might care, as it was in that case; and also because
updates happen more than freezing.

Butthe consequences of not catching a changed piece of code are fairly
harmless. And I'd say things that happen more often are actually easier
to validate than something that with default settings requires hours of
testing...

I've attached a noticeably editorialized patch:

- I'm uncomfortable with the "moved" information not being crash-safe /
replicated. Thus I added a new flag to preserve it, and removed the
masking of the moved bit in the ctid from heap_mask().

- renamed macros to not mention valid / invalid block numbers, but
rather
HeapTupleHeaderSetMovedPartitions / HeapTupleHeaderIndicatesMovedPartitions
and
ItemPointerSetMovedPartitions / ItemPointerIndicatesMovedPartitions

I'm not wedded to these names, but I'l be adamant they they're not
talking about invalid block numbers. Makes code harder to understand
imo.

- removed new assertion from heap_get_latest_tid(), it's wrong for the
case where all row versions are invisible.

- editorialized comments a bit

- added a few more assertions

I went through the existing code to make sure that
a) no checks where missed
b) to evaluate what the consequences when chasing chains would be
c) to evaluate what the consequences when we miss erroring out

WRT b), it's usually just superflous extra work if the new checks
weren't there. I went through all callers accessing xmax (via GetRawXmax
and GetUpdateXid):

b)
- heap rewrites will keep a tuple in hashtable till end of run, then
reset the ctid to self. No real corruption, but we'd not detect
further errors when attempting to follow chain.
- EvalPlanQualFetch would fail to abort loop, attempt to fetch
tuple. This'll extend the relation by a single page, because P_NEW ==
InvalidBlockNumber.
- heap_prune_chain - no changes needed (delete isn't recursed through)
- heap_get_root_tuples - same
- heap_hot_search_buffer - only continues over hot updates
- heap_lock_tuple (and subsidiary routines) - same as EvalPlanQualFetch,
would then return HeapTupleUpdated.

c)
- GetTupleForTrigger - the proper error wouldn't be raised, instead a
NULL tuple would be passed to the trigger
- EvalPlanQualFetch - a NULL tuple would be returned after the
consequences above
- RelationFindReplTupleBy* - wrong error message
- ExecLockRows - no error would be raised, continue normally
- ExecDelete() - tuple ignored without error
- ExecUpdate() - same

Questions:

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to concurrent update"
as the error message. If somebody has a better suggestions.
- should heap_get_latest_tid() error out when the chain ends in a moved
tuple? I personally think it doesn't matter much, the functionality
is so bonkers and underspecified that it doesn't matter anyway ;)
- I'm not that happy with the number of added spec test files with
number postfixes. Can't we combine them into a single file?
- as remarked elsewhere on this thread, I think the used errcode should
be a serialization failure

Greetings,

Andres Freund

Attachments:

v8-0001-Raise-error-when-affecting-tuple-moved-into-diffe.patchtext/x-diff; charset=us-asciiDownload

From 49108d22baad33f1aae253e7c45ac18a2c41ab33 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Apr 2018 18:43:36 -0700
Subject: [PATCH v8] Raise error when affecting tuple moved into different
 partition.

---
 src/backend/access/heap/heapam.c              |  39 ++++-
 src/backend/access/heap/pruneheap.c           |   6 +
 src/backend/access/heap/rewriteheap.c         |   1 +
 src/backend/commands/trigger.c                |   5 +
 src/backend/executor/execMain.c               |   7 +-
 src/backend/executor/execReplication.c        |  22 ++-
 src/backend/executor/nodeLockRows.c           |   5 +
 src/backend/executor/nodeMerge.c              |   2 +-
 src/backend/executor/nodeModifyTable.c        |  27 +++-
 src/include/access/heapam.h                   |   2 +-
 src/include/access/heapam_xlog.h              |   1 +
 src/include/access/htup_details.h             |  12 +-
 src/include/executor/nodeModifyTable.h        |   3 +-
 src/include/storage/itemptr.h                 |  16 ++
 src/test/isolation/expected/merge-update.out  |  25 ++++
 .../expected/partition-key-update-1.out       |  43 ++++++
 .../expected/partition-key-update-2.out       |  23 +++
 .../expected/partition-key-update-3.out       |   9 ++
 .../expected/partition-key-update-4.out       |  29 ++++
 .../expected/partition-key-update-5.out       | 139 ++++++++++++++++++
 src/test/isolation/isolation_schedule         |   5 +
 src/test/isolation/specs/merge-update.spec    |   3 +-
 .../specs/partition-key-update-1.spec         |  39 +++++
 .../specs/partition-key-update-2.spec         |  41 ++++++
 .../specs/partition-key-update-3.spec         |  32 ++++
 .../specs/partition-key-update-4.spec         |  45 ++++++
 .../specs/partition-key-update-5.spec         |  44 ++++++
 27 files changed, 602 insertions(+), 23 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/expected/partition-key-update-4.out
 create mode 100644 src/test/isolation/expected/partition-key-update-5.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-4.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-5.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f96567f5d51..8ffbf6471ca 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2308,6 +2308,7 @@ heap_get_latest_tid(Relation relation,
 		 */
 		if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
 			HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
+			HeapTupleHeaderIndicatesMovedPartitions(tp.t_data) ||
 			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
 		{
 			UnlockReleaseBuffer(buffer);
@@ -3041,6 +3042,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	changing_part - true iff the tuple is being moved to another partition
+ *		table due to an update of the partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3056,7 +3059,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool changing_part)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3325,6 +3328,10 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/* Signal that this is actually a move into another partition */
+	if (changing_part)
+		HeapTupleHeaderSetMovedPartitions(tp.t_data);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3342,7 +3349,11 @@ l1:
 		if (RelationIsAccessibleInLogicalDecoding(relation))
 			log_heap_new_cid(relation, &tp);
 
-		xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+		xlrec.flags = 0;
+		if (all_visible_cleared)
+			xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+		if (changing_part)
+			xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
 		xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
 											  tp.t_data->t_infomask2);
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3450,7 +3461,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false /* changing_part */);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -6051,6 +6062,7 @@ l4:
 next:
 		/* if we find the end of update chain, we're done. */
 		if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
+			HeapTupleHeaderIndicatesMovedPartitions(mytup.t_data) ||
 			ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
 			HeapTupleHeaderIsOnlyLocked(mytup.t_data))
 		{
@@ -6102,7 +6114,12 @@ static HTSU_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, ItemPointer ctid,
 						TransactionId xid, LockTupleMode mode)
 {
-	if (!ItemPointerEquals(&tuple->t_self, ctid))
+	/*
+	 * If the tuple has not been updated, or has moved into another partition
+	 * (effectively a delete) stop here.
+	 */
+	if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
+		!ItemPointerEquals(&tuple->t_self, ctid))
 	{
 		/*
 		 * If this is the first possibly-multixact-able operation in the
@@ -8495,8 +8512,11 @@ heap_xlog_delete(XLogReaderState *record)
 		if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
 			PageClearAllVisible(page);
 
-		/* Make sure there is no forward chain link in t_ctid */
-		htup->t_ctid = target_tid;
+		/* Make sure t_ctid is set correctly */
+		if (xlrec->flags & XLH_DELETE_IS_PARTITION_MOVE)
+			HeapTupleHeaderSetMovedPartitions(htup);
+		else
+			htup->t_ctid = target_tid;
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
@@ -9417,6 +9437,13 @@ heap_mask(char *pagedata, BlockNumber blkno)
 			 */
 			if (HeapTupleHeaderIsSpeculative(page_htup))
 				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+			/*
+			 * NB: Not ignoring ctid changes due to the tuple having moved
+			 * (i.e. HeapTupleHeaderIndicatesMovedPartitions), because that's
+			 * important information that needs to be in-sync between primary
+			 * and standby, and thus is WAL logged.
+			 */
 		}
 
 		/*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f67d7d15df1..c2f5343dac8 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -552,6 +552,9 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		if (!HeapTupleHeaderIsHotUpdated(htup))
 			break;
 
+		/* HOT implies it can't have moved to different partition */
+		Assert(!HeapTupleHeaderIndicatesMovedPartitions(htup));
+
 		/*
 		 * Advance to next chain member.
 		 */
@@ -823,6 +826,9 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
 			if (!HeapTupleHeaderIsHotUpdated(htup))
 				break;
 
+			/* HOT implies it can't have moved to different partition */
+			Assert(!HeapTupleHeaderIndicatesMovedPartitions(htup));
+
 			nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
 			priorXmax = HeapTupleHeaderGetUpdateXid(htup);
 		}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2588c..8d3c861a330 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -424,6 +424,7 @@ rewrite_heap_tuple(RewriteState state,
 	 */
 	if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
 		  HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
+		!HeapTupleHeaderIndicatesMovedPartitions(old_tuple->t_data) &&
 		!(ItemPointerEquals(&(old_tuple->t_self),
 							&(old_tuple->t_data->t_ctid))))
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index e71f921fda1..c263f3a149a 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3308,6 +3308,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index e4d9b0b3f88..69a839c9c60 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2739,6 +2739,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
@@ -2807,7 +2811,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 		 * As above, it should be safe to examine xmax and t_ctid without the
 		 * buffer content lock, because they can't be changing.
 		 */
-		if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+		if (HeapTupleHeaderIndicatesMovedPartitions(tuple.t_data) ||
+			ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
 		{
 			/* deleted, so forget about it */
 			ReleaseBuffer(buffer);
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 971f92a938a..c90db13f9ca 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -191,9 +191,14 @@ retry:
 				break;
 			case HeapTupleUpdated:
 				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -349,9 +354,14 @@ retry:
 				break;
 			case HeapTupleUpdated:
 				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b39ccf7dc13..cfe8e630d38 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeMerge.c b/src/backend/executor/nodeMerge.c
index 0e0d0795d4d..7dd354dde2f 100644
--- a/src/backend/executor/nodeMerge.c
+++ b/src/backend/executor/nodeMerge.c
@@ -222,7 +222,7 @@ lmerge_matched:;
 				slot = ExecDelete(mtstate, tupleid, NULL,
 								  slot, epqstate, estate,
 								  &tuple_deleted, false, &hufd, action,
-								  mtstate->canSetTag);
+								  mtstate->canSetTag, false /* changingPart */);
 
 				break;
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index b03db64e8e1..68d95774607 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -650,7 +650,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   bool processReturning,
 		   HeapUpdateFailureData *hufdp,
 		   MergeActionState *actionState,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool changingPart)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -749,7 +750,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 changingPart);
 
 		/*
 		 * Copy the necessary information, if the caller has asked for it. We
@@ -808,6 +810,10 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
@@ -1162,7 +1168,7 @@ lreplace:;
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
 					   estate, &tuple_deleted, false, hufdp, NULL,
-					   false);
+					   false /* canSetTag */, true /* changingPart */);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1338,6 +1344,10 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
@@ -1527,6 +1537,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of the partition key.
+			 */
+			Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2269,7 +2287,8 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, NULL, NULL, node->canSetTag);
+								  NULL, true, NULL, NULL, node->canSetTag,
+								  false /* changingPart */);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 608f50b0616..048d6317f79 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,7 +167,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool changing_part);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 700e25c36a1..3c9214da6f5 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -93,6 +93,7 @@
 #define XLH_DELETE_CONTAINS_OLD_TUPLE			(1<<1)
 #define XLH_DELETE_CONTAINS_OLD_KEY				(1<<2)
 #define XLH_DELETE_IS_SUPER						(1<<3)
+#define XLH_DELETE_IS_PARTITION_MOVE			(1<<4)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLH_DELETE_CONTAINS_OLD						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cebaea097d1..cf56d4ace43 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -83,8 +83,10 @@
  *
  * A word about t_ctid: whenever a new tuple is stored on disk, its t_ctid
  * is initialized with its own TID (location).  If the tuple is ever updated,
- * its t_ctid is changed to point to the replacement version of the tuple.
- * Thus, a tuple is the latest version of its row iff XMAX is invalid or
+ * its t_ctid is changed to point to the replacement version of the tuple or
+ * the block number (ip_blkid) is invalidated if the tuple is moved from one
+ * partition to another partition relation due to an update of the partition
+ * key.  Thus, a tuple is the latest version of its row iff XMAX is invalid or
  * t_ctid points to itself (in which case, if XMAX is valid, the tuple is
  * either locked or deleted).  One can follow the chain of t_ctid links
  * to find the newest version of the row.  Beware however that VACUUM might
@@ -445,6 +447,12 @@ do { \
 	ItemPointerSet(&(tup)->t_ctid, token, SpecTokenOffsetNumber) \
 )
 
+#define HeapTupleHeaderSetMovedPartitions(tup) \
+	ItemPointerSetMovedPartitions(&(tup)->t_ctid)
+
+#define HeapTupleHeaderIndicatesMovedPartitions(tup) \
+	ItemPointerIndicatesMovedPartitions(&tup->t_ctid)
+
 #define HeapTupleHeaderGetDatumLength(tup) \
 	VARSIZE(tup)
 
diff --git a/src/include/executor/nodeModifyTable.h b/src/include/executor/nodeModifyTable.h
index 686cfa61710..182506ea5fd 100644
--- a/src/include/executor/nodeModifyTable.h
+++ b/src/include/executor/nodeModifyTable.h
@@ -27,7 +27,8 @@ extern TupleTableSlot *ExecDelete(ModifyTableState *mtstate,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
 		   EPQState *epqstate, EState *estate, bool *tupleDeleted,
 		   bool processReturning, HeapUpdateFailureData *hufdp,
-		   MergeActionState *actionState, bool canSetTag);
+		   MergeActionState *actionState, bool canSetTag,
+		   bool changingPart);
 extern TupleTableSlot *ExecUpdate(ModifyTableState *mtstate,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *planSlot, EPQState *epqstate, EState *estate,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b7..626c98f9691 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -154,6 +154,22 @@ typedef ItemPointerData *ItemPointer;
 	(pointer)->ip_posid = InvalidOffsetNumber \
 )
 
+/*
+ * ItemPointerIndicatesMovedPartitions
+ *		True iff the block number indicates the tuple has moved to another
+ *		partition.
+ */
+#define ItemPointerIndicatesMovedPartitions(pointer) \
+	!BlockNumberIsValid(ItemPointerGetBlockNumberNoCheck(pointer))
+
+/*
+ * ItemPointerSetMovedPartitions
+ *		Indicate that the item referenced by the itempointer has moved into a
+ *		different partition.
+ */
+#define ItemPointerSetMovedPartitions(pointer) \
+	ItemPointerSetBlockNumber((pointer), InvalidBlockNumber)
+
 /* ----------------
  *		externs
  * ----------------
diff --git a/src/test/isolation/expected/merge-update.out b/src/test/isolation/expected/merge-update.out
index 60ae42ebd0f..321063b1a44 100644
--- a/src/test/isolation/expected/merge-update.out
+++ b/src/test/isolation/expected/merge-update.out
@@ -204,6 +204,31 @@ step pa_merge2a:
  <waiting ...>
 step c1: COMMIT;
 step pa_merge2a: <... completed>
+error in steps c1 pa_merge2a: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+step pa_select2: SELECT * FROM pa_target;
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
+step c2: COMMIT;
+
+starting permutation: pa_merge2 c1 pa_merge2a pa_select2 c2
+step pa_merge2: 
+  MERGE INTO pa_target t
+  USING (SELECT 1 as key, 'pa_merge1' as val) s
+  ON s.key = t.key
+  WHEN NOT MATCHED THEN
+	INSERT VALUES (s.key, s.val)
+  WHEN MATCHED THEN
+    UPDATE set key = t.key + 1, val = t.val || ' updated by ' || s.val;
+
+step c1: COMMIT;
+step pa_merge2a: 
+  MERGE INTO pa_target t
+  USING (SELECT 1 as key, 'pa_merge2a' as val) s
+  ON s.key = t.key
+  WHEN NOT MATCHED THEN
+	INSERT VALUES (s.key, s.val)
+  WHEN MATCHED THEN
+	UPDATE set key = t.key + 1, val = t.val || ' updated by ' || s.val;
+
 step pa_select2: SELECT * FROM pa_target;
 key            val            
 
diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 00000000000..bfbeccc852d
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,43 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2u s1c s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2u s1u s2c s1c
+step s2u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+
+starting permutation: s1u s1c s2d s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2d s1c s2c
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be updated was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2d s1u s2c s1c
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 00000000000..06460a8da76
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,23 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u s1c s2u s2c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1u s2u s1c s2c
+step s1u: UPDATE foo SET b='EFG' WHERE a=1;
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u: <... completed>
+error in steps s1c s2u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s2u s1u s2c s1c
+step s2u: UPDATE foo SET b='XYZ' WHERE a=1;
+step s1u: UPDATE foo SET b='EFG' WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+error in steps s2c s1u: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s1c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 00000000000..1be63dfb8be
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,9 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1u3 s2i s1c s2c
+step s1u3: UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 00000000000..363de0d69c2
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,29 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1u s2donothing s3donothing s1c s2c s3select s3c
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
+
+starting permutation: s2donothing s1u s3donothing s1c s2c s3select s3c
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-5.out b/src/test/isolation/expected/partition-key-update-5.out
new file mode 100644
index 00000000000..42dfe64ad31
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-5.out
@@ -0,0 +1,139 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 99dd7c6bdbf..14b9f2e7122 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -72,3 +72,8 @@ test: timeouts
 test: vacuum-concurrent-drop
 test: predicate-gist
 test: predicate-gin
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
+test: partition-key-update-4
+test: partition-key-update-5
diff --git a/src/test/isolation/specs/merge-update.spec b/src/test/isolation/specs/merge-update.spec
index 64e849966ec..625b477eb9f 100644
--- a/src/test/isolation/specs/merge-update.spec
+++ b/src/test/isolation/specs/merge-update.spec
@@ -129,4 +129,5 @@ permutation "merge1" "merge2a" "a1" "select2" "c2"
 permutation "merge1" "merge2b" "c1" "select2" "c2"
 permutation "merge1" "merge2c" "c1" "select2" "c2"
 permutation "pa_merge1" "pa_merge2a" "c1" "pa_select2" "c2"
-permutation "pa_merge2" "pa_merge2a" "c1" "pa_select2" "c2"
+permutation "pa_merge2" "pa_merge2a" "c1" "pa_select2" "c2" # fails
+permutation "pa_merge2" "c1" "pa_merge2a" "pa_select2" "c2" # succeeds
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 00000000000..32d555c37cd
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,39 @@
+# Concurrency error from ExecUpdate and ExecDelete.
+
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u" "s1c" "s2u" "s2c"
+permutation "s1u" "s2u" "s1c" "s2c"
+permutation "s2u" "s1u" "s2c" "s1c"
+
+permutation "s1u" "s1c" "s2d" "s2c"
+permutation "s1u" "s2d" "s1c" "s2c"
+permutation "s2d" "s1u" "s2c" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 00000000000..8a952892c28
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,41 @@
+# Concurrency error from GetTupleForTrigger
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# update a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+  CREATE FUNCTION func_foo_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER foo_mod_a BEFORE UPDATE ON foo1
+   FOR EACH ROW EXECUTE PROCEDURE func_foo_mod_a();
+}
+
+teardown
+{
+  DROP TRIGGER foo_mod_a ON foo1;
+  DROP FUNCTION func_foo_mod_a();
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u"	{ UPDATE foo SET b='XYZ' WHERE a=1; }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u" "s1c" "s2u" "s2c"
+permutation "s1u" "s2u" "s1c" "s2c"
+permutation "s2u" "s1u" "s2c" "s1c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 00000000000..1baa0159de1
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,32 @@
+# Concurrency error from ExecLockRows
+
+# Like partition-key-update-1.spec, throw an error where a session trying to
+# lock a row that has been moved to another partition due to a concurrent
+# update by other seesion.
+
+setup
+{
+  CREATE TABLE foo_r (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_r1 PARTITION OF foo_r FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_r2 PARTITION OF foo_r FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_r VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_r1_a_unique ON foo_r1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_r1(a));
+}
+
+teardown
+{
+  DROP TABLE bar, foo_r;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u3"	{ UPDATE foo_r SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+step "s2c"	{ COMMIT; }
+
+permutation "s1u3" "s2i" "s1c" "s2c"
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 00000000000..699e2e727f7
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,45 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING test
+#
+# This test tries to expose problems with the interaction between concurrent
+# sessions during an update of the partition key and INSERT...ON CONFLICT DO
+# NOTHING on a partitioned table.
+#
+# The convention here is that session 1 moves row from one partition to
+# another due update of the partition key and session 2 always ends up
+# inserting, and session 3 always ends up doing nothing.
+#
+# Note: This test is slightly resemble to insert-conflict-do-nothing test.
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c"	{ COMMIT; }
+
+session "s3"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; }
+step "s3select" { SELECT * FROM foo ORDER BY a; }
+step "s3c"	{ COMMIT; }
+
+# Regular case where one session block-waits on another to determine if it
+# should proceed with an insert or do nothing.
+permutation "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3select" "s3c"
+permutation "s2donothing" "s1u" "s3donothing" "s1c" "s2c" "s3select" "s3c"
diff --git a/src/test/isolation/specs/partition-key-update-5.spec b/src/test/isolation/specs/partition-key-update-5.spec
new file mode 100644
index 00000000000..a6efea13817
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-5.spec
@@ -0,0 +1,44 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING
+# test on partitioned table with multiple rows in higher isolation levels.
+#
+# Note: This test is resemble to insert-conflict-do-nothing-2 test
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s2begins"	{ BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c" { COMMIT; }
+step "s2select" { SELECT * FROM foo ORDER BY a; }
+
+session "s3"
+step "s3beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s3begins" { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; }
+step "s3c" { COMMIT; }
+
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
-- 
2.17.0.rc1.dirty

#66

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Andres Freund (#65)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

I've attached a noticeably editorialized patch:

- I'm uncomfortable with the "moved" information not being crash-safe /
replicated. Thus I added a new flag to preserve it, and removed the
masking of the moved bit in the ctid from heap_mask().

- renamed macros to not mention valid / invalid block numbers, but
rather
HeapTupleHeaderSetMovedPartitions / HeapTupleHeaderIndicatesMovedPartitions
and
ItemPointerSetMovedPartitions / ItemPointerIndicatesMovedPartitions

I'm not wedded to these names, but I'l be adamant they they're not
talking about invalid block numbers. Makes code harder to understand
imo.

The new names for macros make the code easier to understand.

- removed new assertion from heap_get_latest_tid(), it's wrong for the
case where all row versions are invisible.

Why? tid is both an input and output parameter. The input tid is
valid and is verified at the top of the function, now if no row
version is visible, then it should have the same value as passed Tid.
I am not telling that it was super important to have that assertion,
but if it is valid then it can catch a case where we might have missed
checking the tuple which has invalid block number (essentialy the case
introduced by the patch).

I assume you are talking about below assertion:

+
+ /* Make sure that the return value has a valid block number */
+ Assert(ItemPointerValidBlockNumber(tid));

Questions:

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to concurrent update"
as the error message. If somebody has a better suggestions.

I don't have any better suggestion, but I have noticed a small
inconsistency in the message. In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ...".

- should heap_get_latest_tid() error out when the chain ends in a moved
tuple?

Won't the same question applies to the similar usage in
EvalPlanQualFetch and heap_lock_updated_tuple_rec. In
EvalPlanQualFetch, we consider such a tuple to be deleted and will
silently miss/skip it which seems contradictory to the places where we
have detected such a situation and raised an error. In
heap_lock_updated_tuple_rec, we will skip locking the versions of a
tuple after we encounter a tuple version that is moved to another
partition.

- I'm not that happy with the number of added spec test files with
number postfixes. Can't we combine them into a single file?

+1 for doing so.

- as remarked elsewhere on this thread, I think the used errcode should
be a serialization failure

No problem. I was telling up thread that the used error code has some
precedent in the code for similar usage, but we have precedent for the
serialization failure error code as well, so it should be okay to use
it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#67

David G. Johnston

david.g.johnston@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#66)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Wednesday, April 4, 2018, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

Questions:

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to

concurrent update"

as the error message. If somebody has a better suggestions.

I don't have any better suggestion, but I have noticed a small
inconsistency in the message. In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ..."

The whole "moved to another partition" explains why and seems better placed
in the errdetail. The error itself should indicate which attempted action
failed. And the attempted action for the end user usually isn't the scope
of "locked tuple" - it's the insert or update, the locking is a side effect
(why).

David J.

#68

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: David G. Johnston (#67)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Apr 5, 2018 at 10:40 AM, David G. Johnston
<david.g.johnston@gmail.com> wrote:

On Wednesday, April 4, 2018, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

Questions:

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to
concurrent update"
as the error message. If somebody has a better suggestions.

I don't have any better suggestion, but I have noticed a small
inconsistency in the message. In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ..."

The whole "moved to another partition" explains why and seems better placed
in the errdetail. The error itself should indicate which attempted action
failed. And the attempted action for the end user usually isn't the scope
of "locked tuple" - it's the insert or update, the locking is a side effect
(why).

I don't think locking is just a side effect, it will be used when the
user tries to lock tuple via "Select .. For Key Share"

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#69

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Andres Freund (#65)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

I've attached a noticeably editorialized patch:

+           /*
+            * As long as we don't support an UPDATE of INSERT ON CONFLICT
for
+            * a partitioned table we shouldn't reach to a case where tuple
to
+            * be lock is moved to another partition due to concurrent
update
+            * of the partition key.
+            */
+           Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid));
+

This is no longer true; at least not entirely. We still don't support ON
CONFLICT DO UPDATE to move a row to a different partition, but otherwise it
works now. See 555ee77a9668e3f1b03307055b5027e13bf1a715.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#70

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: David G. Johnston (#67)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On 2018-04-04 22:10:06 -0700, David G. Johnston wrote:

On Wednesday, April 4, 2018, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

Questions:

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to

concurrent update"

as the error message. If somebody has a better suggestions.

I don't have any better suggestion, but I have noticed a small
inconsistency in the message. In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ..."

The whole "moved to another partition" explains why and seems better placed
in the errdetail. The error itself should indicate which attempted action
failed. And the attempted action for the end user usually isn't the scope
of "locked tuple" - it's the insert or update, the locking is a side effect
(why).

Well, update/delete have their own messages, don't think you can get
this for inserts (there'd be no tuple to follow across EPQ). The case I
copied from above, was locking a tuple, hence the reference to that.

I don't agree with moving "moved to another partition" to errdetail,
that's *the* crucial detail. If there's anything in the error message,
it should be that.

Greetings,

Andres Freund

#71

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Pavan Deolasee (#69)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Pavan Deolasee wrote:

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

+           /*
+            * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+            * a partitioned table we shouldn't reach to a case where tuple to
+            * be lock is moved to another partition due to concurrent update
+            * of the partition key.
+            */
+           Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid));
+
This is no longer true; at least not entirely. We still don't support ON
CONFLICT DO UPDATE to move a row to a different partition, but otherwise it
works now. See 555ee77a9668e3f1b03307055b5027e13bf1a715.

Right. So I think the assert() should remain, but the comment should
say "As long as we don't update moving a tuple to a different partition
during INSERT ON CONFLICT DO UPDATE on a partitioned table, ..."

FWIW I think the code flow is easier to read with the renamed macros.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#72

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Amit Kapila (#66)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On 2018-04-05 10:17:59 +0530, Amit Kapila wrote:

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
Why? tid is both an input and output parameter. The input tid is
valid and is verified at the top of the function, now if no row
version is visible, then it should have the same value as passed Tid.
I am not telling that it was super important to have that assertion,
but if it is valid then it can catch a case where we might have missed
checking the tuple which has invalid block number (essentialy the case
introduced by the patch).

You're right. It's bonkers that the output parameter isn't set to an
invalid value if the tuple isn't found. Makes the whole function
entirely useless.

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to concurrent update"
as the error message. If somebody has a better suggestions.

I don't have any better suggestion, but I have noticed a small
inconsistency in the message. In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ...".

Yea, I noticed that too. Note that the message a few lines up is
similarly wrong:
ereport(ERROR,
(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
errmsg("tuple to be updated was already modified by an operation triggered by the current command"),
errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));

- should heap_get_latest_tid() error out when the chain ends in a moved
tuple?

Won't the same question applies to the similar usage in
EvalPlanQualFetch and heap_lock_updated_tuple_rec.

I don't think so?

In EvalPlanQualFetch, we consider such a tuple to be deleted and will
silently miss/skip it which seems contradictory to the places where we
have detected such a situation and raised an error.

if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("tuple to be locked was already moved to another partition due to concurrent update")));

In heap_lock_updated_tuple_rec, we will skip locking the versions of a
tuple after we encounter a tuple version that is moved to another
partition.

I don't think that's true? We'll not lock *any* tuple in that case, but
return HeapTupleUpdated. Which callers then interpret in whatever way
they need to?

Greetings,

Andres Freund

#73

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Andres Freund (#72)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Apr 6, 2018 at 1:13 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-04-05 10:17:59 +0530, Amit Kapila wrote:

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
Why? tid is both an input and output parameter. The input tid is
valid and is verified at the top of the function, now if no row
version is visible, then it should have the same value as passed Tid.
I am not telling that it was super important to have that assertion,
but if it is valid then it can catch a case where we might have missed
checking the tuple which has invalid block number (essentialy the case
introduced by the patch).

You're right. It's bonkers that the output parameter isn't set to an
invalid value if the tuple isn't found. Makes the whole function
entirely useless.

Yeah, kind of, but I think the same is noted in function comments and
caller is prepared to deal with it. See comments atop
heap_get_latest_tid (Note that it will not be changed if no version of
the row passes the snapshot test.). The caller (TidNext) will just
ignore such TID and move to next TID. I think you have a point that
one might have designed it differently by setting output value to
invalid value which would make caller to detect it easily. In short,
it's just a matter of choice whether we want to have Assert as Amul
has it in his patch or just leave. It should be okay either way.

- should heap_get_latest_tid() error out when the chain ends in a moved
tuple?

Won't the same question applies to the similar usage in
EvalPlanQualFetch and heap_lock_updated_tuple_rec.

I don't think so?

In EvalPlanQualFetch, we consider such a tuple to be deleted and will
silently miss/skip it which seems contradictory to the places where we
have detected such a situation and raised an error.

if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("tuple to be locked was already moved to another partition due to concurrent update")));

I was talking about the case when the tuple version is not visible aka
the below code:

/*
* If we get here, the tuple was found but failed SnapshotDirty.
..
*/
if (HeapTupleHeaderIndicatesMovedPartitions(tuple.t_data) ||
ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
return NULL;
}

Normally, if the tuple would have been updated such that it landed in
the same partition, then the chain would have continued, but now
because tuple is moved to another partition, we would end the chain
without letting the user know about it. Just see below the
repercussion of same.

In heap_lock_updated_tuple_rec, we will skip locking the versions of a
tuple after we encounter a tuple version that is moved to another
partition.

I don't think that's true? We'll not lock *any* tuple in that case,

I think what will happen is that we will end up locking some versions
in the chain and then silently skip others. See below example:

Setup
-----------
postgres=# create table t1(c1 int, c2 varchar) partition by range(c1);
CREATE TABLE
postgres=# create table t1_part_1 partition of t1 for values from (1) to (100);
CREATE TABLE
postgres=# create table t1_part_2 partition of t1 for values from
(100) to (200);
CREATE TABLE
postgres=# insert into t1 values(1, 'aaa');
INSERT 0 1

Session-1
---------------
postgres=# begin;
BEGIN
postgres=# update t1 set c2='bbb' where c1=1;
UPDATE 1
postgres=# update t1 set c2='ccc' where c1=1;
UPDATE 1
postgres=# update t1 set c1=102 where c1=1;
UPDATE 1

Session-2
----------------
postgres=# begin;
BEGIN
postgres=# select * from t1 where c1=1 for key share;

Here, the Session-2 will lock one of the tuple versions and then wait
for Session-1 to end (as there is a conflicting update). Now, commit
the transaction in Session-1.

Session-1
---------------
Commit;

Now the Session-2 will skip the latest version of a tuple as it is
moved to another partition.

Session-2
----------------
postgres=# select * from t1 where c1=1 for key share;
c1 | c2
----+----
(0 rows)

The end result is that Session-2 will lock one of the versions of the
tuple and silently skipped locking latest version of the tuple. I
feel that is slightly confusing behavior with respect to the current
behavior or when tuple updates landed in the same partition.

I think if we return an error in EvalPlanQualFetch at the place
mentioned above, the behavior will be sane.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#74

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#66)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Apr 5, 2018 at 10:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

[...]

Questions:

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to concurrent update"
as the error message. If somebody has a better suggestions.

I don't have any better suggestion, but I have noticed a small
inconsistency in the message. In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ...".

+1, will do the error message change in ExecDelete.

- should heap_get_latest_tid() error out when the chain ends in a moved
tuple?

Won't the same question applies to the similar usage in
EvalPlanQualFetch and heap_lock_updated_tuple_rec. In
EvalPlanQualFetch, we consider such a tuple to be deleted and will
silently miss/skip it which seems contradictory to the places where we
have detected such a situation and raised an error. In
heap_lock_updated_tuple_rec, we will skip locking the versions of a
tuple after we encounter a tuple version that is moved to another
partition.

- I'm not that happy with the number of added spec test files with
number postfixes. Can't we combine them into a single file?

+1 for doing so.

Agree, we could combine specs-1/2/3 into a single file which is doing the error
check and for the specs-4/5, imho, let it be, as it is checking different the
scenario of ON CONFLICT DO NOTHING on the moved tuple and also
it resembles the existing ON CONFLICT isolation tests.

Will post rebase version of Andres' patch[1] including aforementioned
changes within an hour, thanks

1] /messages/by-id/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de

Regards,
Amul

#75

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: amul sul (#74)

1 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Apr 6, 2018 at 12:07 PM, amul sul <sulamul@gmail.com> wrote:

On Thu, Apr 5, 2018 at 10:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

[...]

Questions:

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to concurrent update"
as the error message. If somebody has a better suggestions.

I don't have any better suggestion, but I have noticed a small
inconsistency in the message. In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ...".

+1, will do the error message change in ExecDelete.

- should heap_get_latest_tid() error out when the chain ends in a moved
tuple?

Won't the same question applies to the similar usage in
EvalPlanQualFetch and heap_lock_updated_tuple_rec. In
EvalPlanQualFetch, we consider such a tuple to be deleted and will
silently miss/skip it which seems contradictory to the places where we
have detected such a situation and raised an error. In
heap_lock_updated_tuple_rec, we will skip locking the versions of a
tuple after we encounter a tuple version that is moved to another
partition.

- I'm not that happy with the number of added spec test files with
number postfixes. Can't we combine them into a single file?

+1 for doing so.

Agree, we could combine specs-1/2/3 into a single file which is doing the error
check and for the specs-4/5, imho, let it be, as it is checking different the
scenario of ON CONFLICT DO NOTHING on the moved tuple and also
it resembles the existing ON CONFLICT isolation tests.

Will post rebase version of Andres' patch[1] including aforementioned
changes within an hour, thanks

1] /messages/by-id/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de

Updated patch attached.

Regards,
Amul

Attachments:

v9-0001-Raise-error-when-affecting-tuple-moved-into.patchapplication/octet-stream; name=v9-0001-Raise-error-when-affecting-tuple-moved-into.patchDownload

From db535593fb5ee7eac6b683a59454e30732983fb8 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Fri, 6 Apr 2018 09:41:28 +0530
Subject: [PATCH] [PATCH v9] Raise error when affecting tuple moved into
 different partition.

== CHANGES ==
 Its the rebase version of Andres Freund patch v8[1] with the
 following additional changes
 1. Error message changes in ExecDelete as per Amit Kapila's
    suggestion[2]
 2. Combine isolation test specs1/2 and 3 in the specs1
 3. Argument changing_part of heap_delete renamed to ChangingPart to be
    consistent with ExecDelete

== REF ==
1] https://postgr.es/m/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de
2] https://postgr.es/m/CAAJ_b97H1hecfogVyLUZoCr_EXeTOWg5%2B2N-FUyJdcp48yXv9g%40mail.gmail.com
---
 src/backend/access/heap/heapam.c                   |  39 +++++-
 src/backend/access/heap/pruneheap.c                |   6 +
 src/backend/access/heap/rewriteheap.c              |   1 +
 src/backend/commands/trigger.c                     |   5 +
 src/backend/executor/execMain.c                    |   7 +-
 src/backend/executor/execMerge.c                   |   3 +-
 src/backend/executor/execReplication.c             |  22 +++-
 src/backend/executor/nodeLockRows.c                |   5 +
 src/backend/executor/nodeModifyTable.c             |  27 +++-
 src/include/access/heapam.h                        |   2 +-
 src/include/access/heapam_xlog.h                   |   1 +
 src/include/access/htup_details.h                  |  12 +-
 src/include/executor/nodeModifyTable.h             |   3 +-
 src/include/storage/itemptr.h                      |  16 +++
 src/test/isolation/expected/merge-update.out       |  25 ++++
 .../isolation/expected/partition-key-update-1.out  |  66 ++++++++++
 .../isolation/expected/partition-key-update-2.out  |  29 +++++
 .../isolation/expected/partition-key-update-3.out  | 139 +++++++++++++++++++++
 src/test/isolation/isolation_schedule              |   3 +
 src/test/isolation/specs/merge-update.spec         |   3 +-
 .../isolation/specs/partition-key-update-1.spec    |  78 ++++++++++++
 .../isolation/specs/partition-key-update-2.spec    |  45 +++++++
 .../isolation/specs/partition-key-update-3.spec    |  44 +++++++
 23 files changed, 558 insertions(+), 23 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f96567f5d5..28776dbff3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2308,6 +2308,7 @@ heap_get_latest_tid(Relation relation,
 		 */
 		if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
 			HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
+			HeapTupleHeaderIndicatesMovedPartitions(tp.t_data) ||
 			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
 		{
 			UnlockReleaseBuffer(buffer);
@@ -3041,6 +3042,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	changingPart - true iff the tuple is being moved to another partition
+ *		table due to an update of the partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3056,7 +3059,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool changingPart)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3325,6 +3328,10 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/* Signal that this is actually a move into another partition */
+	if (changingPart)
+		HeapTupleHeaderSetMovedPartitions(tp.t_data);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3342,7 +3349,11 @@ l1:
 		if (RelationIsAccessibleInLogicalDecoding(relation))
 			log_heap_new_cid(relation, &tp);
 
-		xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+		xlrec.flags = 0;
+		if (all_visible_cleared)
+			xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+		if (changingPart)
+			xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
 		xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
 											  tp.t_data->t_infomask2);
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3450,7 +3461,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false /* changingPart */);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -6051,6 +6062,7 @@ l4:
 next:
 		/* if we find the end of update chain, we're done. */
 		if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
+			HeapTupleHeaderIndicatesMovedPartitions(mytup.t_data) ||
 			ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
 			HeapTupleHeaderIsOnlyLocked(mytup.t_data))
 		{
@@ -6102,7 +6114,12 @@ static HTSU_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, ItemPointer ctid,
 						TransactionId xid, LockTupleMode mode)
 {
-	if (!ItemPointerEquals(&tuple->t_self, ctid))
+	/*
+	 * If the tuple has not been updated, or has moved into another partition
+	 * (effectively a delete) stop here.
+	 */
+	if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
+		!ItemPointerEquals(&tuple->t_self, ctid))
 	{
 		/*
 		 * If this is the first possibly-multixact-able operation in the
@@ -8495,8 +8512,11 @@ heap_xlog_delete(XLogReaderState *record)
 		if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
 			PageClearAllVisible(page);
 
-		/* Make sure there is no forward chain link in t_ctid */
-		htup->t_ctid = target_tid;
+		/* Make sure t_ctid is set correctly */
+		if (xlrec->flags & XLH_DELETE_IS_PARTITION_MOVE)
+			HeapTupleHeaderSetMovedPartitions(htup);
+		else
+			htup->t_ctid = target_tid;
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
@@ -9417,6 +9437,13 @@ heap_mask(char *pagedata, BlockNumber blkno)
 			 */
 			if (HeapTupleHeaderIsSpeculative(page_htup))
 				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+			/*
+			 * NB: Not ignoring ctid changes due to the tuple having moved
+			 * (i.e. HeapTupleHeaderIndicatesMovedPartitions), because that's
+			 * important information that needs to be in-sync between primary
+			 * and standby, and thus is WAL logged.
+			 */
 		}
 
 		/*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f67d7d15df..c2f5343dac 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -552,6 +552,9 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		if (!HeapTupleHeaderIsHotUpdated(htup))
 			break;
 
+		/* HOT implies it can't have moved to different partition */
+		Assert(!HeapTupleHeaderIndicatesMovedPartitions(htup));
+
 		/*
 		 * Advance to next chain member.
 		 */
@@ -823,6 +826,9 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
 			if (!HeapTupleHeaderIsHotUpdated(htup))
 				break;
 
+			/* HOT implies it can't have moved to different partition */
+			Assert(!HeapTupleHeaderIndicatesMovedPartitions(htup));
+
 			nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
 			priorXmax = HeapTupleHeaderGetUpdateXid(htup);
 		}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2588..8d3c861a33 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -424,6 +424,7 @@ rewrite_heap_tuple(RewriteState state,
 	 */
 	if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
 		  HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
+		!HeapTupleHeaderIndicatesMovedPartitions(old_tuple->t_data) &&
 		!(ItemPointerEquals(&(old_tuple->t_self),
 							&(old_tuple->t_data->t_ctid))))
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index a189356cad..2aab3ce77a 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3314,6 +3314,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index e4d9b0b3f8..69a839c9c6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2739,6 +2739,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+						ereport(ERROR,
+								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
@@ -2807,7 +2811,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 		 * As above, it should be safe to examine xmax and t_ctid without the
 		 * buffer content lock, because they can't be changing.
 		 */
-		if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+		if (HeapTupleHeaderIndicatesMovedPartitions(tuple.t_data) ||
+			ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
 		{
 			/* deleted, so forget about it */
 			ReleaseBuffer(buffer);
diff --git a/src/backend/executor/execMerge.c b/src/backend/executor/execMerge.c
index d39ddd3034..d75d7e5ab2 100644
--- a/src/backend/executor/execMerge.c
+++ b/src/backend/executor/execMerge.c
@@ -324,7 +324,8 @@ lmerge_matched:;
 				slot = ExecDelete(mtstate, tupleid, NULL,
 								  slot, epqstate, estate,
 								  &tuple_deleted, false, &hufd, action,
-								  mtstate->canSetTag);
+								  mtstate->canSetTag,
+								  false /* changingPart */);
 
 				break;
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 971f92a938..c90db13f9c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -191,9 +191,14 @@ retry:
 				break;
 			case HeapTupleUpdated:
 				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -349,9 +354,14 @@ retry:
 				break;
 			case HeapTupleUpdated:
 				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b39ccf7dc1..cfe8e630d3 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0ebf37bd24..af2d473ee3 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -645,7 +645,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   bool processReturning,
 		   HeapUpdateFailureData *hufdp,
 		   MergeActionState *actionState,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool changingPart)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -744,7 +745,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 changingPart);
 
 		/*
 		 * Copy the necessary information, if the caller has asked for it. We
@@ -803,6 +805,10 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be deleted was already moved to another partition due to concurrent update")));
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
@@ -1157,7 +1163,7 @@ lreplace:;
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
 					   estate, &tuple_deleted, false, hufdp, NULL,
-					   false);
+					   false /* canSetTag */, true /* changingPart */);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1333,6 +1339,10 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
@@ -1522,6 +1532,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of the partition key.
+			 */
+			Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2264,7 +2282,8 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, NULL, NULL, node->canSetTag);
+								  NULL, true, NULL, NULL, node->canSetTag,
+								  false /* changingPart */);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 608f50b061..7d756f20b0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,7 +167,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool changingPart);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 700e25c36a..3c9214da6f 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -93,6 +93,7 @@
 #define XLH_DELETE_CONTAINS_OLD_TUPLE			(1<<1)
 #define XLH_DELETE_CONTAINS_OLD_KEY				(1<<2)
 #define XLH_DELETE_IS_SUPER						(1<<3)
+#define XLH_DELETE_IS_PARTITION_MOVE			(1<<4)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLH_DELETE_CONTAINS_OLD						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cebaea097d..cf56d4ace4 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -83,8 +83,10 @@
  *
  * A word about t_ctid: whenever a new tuple is stored on disk, its t_ctid
  * is initialized with its own TID (location).  If the tuple is ever updated,
- * its t_ctid is changed to point to the replacement version of the tuple.
- * Thus, a tuple is the latest version of its row iff XMAX is invalid or
+ * its t_ctid is changed to point to the replacement version of the tuple or
+ * the block number (ip_blkid) is invalidated if the tuple is moved from one
+ * partition to another partition relation due to an update of the partition
+ * key.  Thus, a tuple is the latest version of its row iff XMAX is invalid or
  * t_ctid points to itself (in which case, if XMAX is valid, the tuple is
  * either locked or deleted).  One can follow the chain of t_ctid links
  * to find the newest version of the row.  Beware however that VACUUM might
@@ -445,6 +447,12 @@ do { \
 	ItemPointerSet(&(tup)->t_ctid, token, SpecTokenOffsetNumber) \
 )
 
+#define HeapTupleHeaderSetMovedPartitions(tup) \
+	ItemPointerSetMovedPartitions(&(tup)->t_ctid)
+
+#define HeapTupleHeaderIndicatesMovedPartitions(tup) \
+	ItemPointerIndicatesMovedPartitions(&tup->t_ctid)
+
 #define HeapTupleHeaderGetDatumLength(tup) \
 	VARSIZE(tup)
 
diff --git a/src/include/executor/nodeModifyTable.h b/src/include/executor/nodeModifyTable.h
index 94fd60c38c..7e9ab3cb6b 100644
--- a/src/include/executor/nodeModifyTable.h
+++ b/src/include/executor/nodeModifyTable.h
@@ -27,7 +27,8 @@ extern TupleTableSlot *ExecDelete(ModifyTableState *mtstate,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
 		   EPQState *epqstate, EState *estate, bool *tupleDeleted,
 		   bool processReturning, HeapUpdateFailureData *hufdp,
-		   MergeActionState *actionState, bool canSetTag);
+		   MergeActionState *actionState, bool canSetTag,
+		   bool changingPart);
 extern TupleTableSlot *ExecUpdate(ModifyTableState *mtstate,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *planSlot, EPQState *epqstate, EState *estate,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..626c98f969 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -154,6 +154,22 @@ typedef ItemPointerData *ItemPointer;
 	(pointer)->ip_posid = InvalidOffsetNumber \
 )
 
+/*
+ * ItemPointerIndicatesMovedPartitions
+ *		True iff the block number indicates the tuple has moved to another
+ *		partition.
+ */
+#define ItemPointerIndicatesMovedPartitions(pointer) \
+	!BlockNumberIsValid(ItemPointerGetBlockNumberNoCheck(pointer))
+
+/*
+ * ItemPointerSetMovedPartitions
+ *		Indicate that the item referenced by the itempointer has moved into a
+ *		different partition.
+ */
+#define ItemPointerSetMovedPartitions(pointer) \
+	ItemPointerSetBlockNumber((pointer), InvalidBlockNumber)
+
 /* ----------------
  *		externs
  * ----------------
diff --git a/src/test/isolation/expected/merge-update.out b/src/test/isolation/expected/merge-update.out
index 60ae42ebd0..00069a3e45 100644
--- a/src/test/isolation/expected/merge-update.out
+++ b/src/test/isolation/expected/merge-update.out
@@ -204,6 +204,31 @@ step pa_merge2a:
  <waiting ...>
 step c1: COMMIT;
 step pa_merge2a: <... completed>
+error in steps c1 pa_merge2a: ERROR:  tuple to be deleted was already moved to another partition due to concurrent update
+step pa_select2: SELECT * FROM pa_target;
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
+step c2: COMMIT;
+
+starting permutation: pa_merge2 c1 pa_merge2a pa_select2 c2
+step pa_merge2: 
+  MERGE INTO pa_target t
+  USING (SELECT 1 as key, 'pa_merge1' as val) s
+  ON s.key = t.key
+  WHEN NOT MATCHED THEN
+	INSERT VALUES (s.key, s.val)
+  WHEN MATCHED THEN
+    UPDATE set key = t.key + 1, val = t.val || ' updated by ' || s.val;
+
+step c1: COMMIT;
+step pa_merge2a: 
+  MERGE INTO pa_target t
+  USING (SELECT 1 as key, 'pa_merge2a' as val) s
+  ON s.key = t.key
+  WHEN NOT MATCHED THEN
+	INSERT VALUES (s.key, s.val)
+  WHEN MATCHED THEN
+	UPDATE set key = t.key + 1, val = t.val || ' updated by ' || s.val;
+
 step pa_select2: SELECT * FROM pa_target;
 key            val            
 
diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..af92fbe1f7
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,66 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s2b s1u s1c s2d s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1b s2b s1u s2d s1c s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be deleted was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s1b s2b s2d s1u s2c s1c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+
+starting permutation: s1b s2b s1u2 s1c s2u2 s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u2: UPDATE footrg SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u2: UPDATE footrg SET b='XYZ' WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1b s2b s1u2 s2u2 s1c s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u2: UPDATE footrg SET b='EFG' WHERE a=1;
+step s2u2: UPDATE footrg SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u2: <... completed>
+error in steps s1c s2u2: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s1b s2b s2u2 s1u2 s2c s1c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u2: UPDATE footrg SET b='XYZ' WHERE a=1;
+step s1u2: UPDATE footrg SET b='EFG' WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u2: <... completed>
+error in steps s2c s1u2: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s1c: COMMIT;
+
+starting permutation: s1b s2b s1u3 s2i s1c s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u3: UPDATE foo_rang_parted SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..363de0d69c
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,29 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1u s2donothing s3donothing s1c s2c s3select s3c
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
+
+starting permutation: s2donothing s1u s3donothing s1c s2c s3select s3c
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..42dfe64ad3
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,139 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 31900cb920..fdff58deb9 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -72,6 +72,9 @@ test: timeouts
 test: vacuum-concurrent-drop
 test: predicate-gist
 test: predicate-gin
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
 # The checksum_enable suite will enable checksums for the cluster so should
 # not run before anything expecting the cluster to have checksums turned off
 test: checksum_cancel
diff --git a/src/test/isolation/specs/merge-update.spec b/src/test/isolation/specs/merge-update.spec
index 64e849966e..625b477eb9 100644
--- a/src/test/isolation/specs/merge-update.spec
+++ b/src/test/isolation/specs/merge-update.spec
@@ -129,4 +129,5 @@ permutation "merge1" "merge2a" "a1" "select2" "c2"
 permutation "merge1" "merge2b" "c1" "select2" "c2"
 permutation "merge1" "merge2c" "c1" "select2" "c2"
 permutation "pa_merge1" "pa_merge2a" "c1" "pa_select2" "c2"
-permutation "pa_merge2" "pa_merge2a" "c1" "pa_select2" "c2"
+permutation "pa_merge2" "pa_merge2a" "c1" "pa_select2" "c2" # fails
+permutation "pa_merge2" "c1" "pa_merge2a" "pa_select2" "c2" # succeeds
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..2cccffa7b4
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,78 @@
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# lock/update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  --
+  -- Setup to test an error from ExecUpdate and ExecDelete.
+  --
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+
+  --
+  -- Setup to test an error from GetTupleForTrigger
+  --
+  CREATE TABLE footrg (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE footrg1 PARTITION OF footrg FOR VALUES IN (1);
+  CREATE TABLE footrg2 PARTITION OF footrg FOR VALUES IN (2);
+  INSERT INTO footrg VALUES (1, 'ABC');
+  CREATE FUNCTION func_footrg_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_mod_a BEFORE UPDATE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg_mod_a();
+
+  --
+  -- Setup to test an error from ExecLockRows
+  --
+  CREATE TABLE foo_rang_parted (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_rang_parted1 PARTITION OF foo_rang_parted FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_rang_parted2 PARTITION OF foo_rang_parted FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_rang_parted VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_rang_parted1_a_unique ON foo_rang_parted1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_rang_parted1(a));
+}
+
+teardown
+{
+  DROP TABLE foo;
+  DROP TRIGGER footrg_mod_a ON footrg1;
+  DROP FUNCTION func_footrg_mod_a();
+  DROP TABLE footrg;
+  DROP TABLE bar, foo_rang_parted;
+}
+
+session "s1"
+step "s1b"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1u2"	{ UPDATE footrg SET b='EFG' WHERE a=1; }
+step "s1u3"	{ UPDATE foo_rang_parted SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2b"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2u2"	{ UPDATE footrg SET b='XYZ' WHERE a=1; }
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+step "s2c"	{ COMMIT; }
+
+# Concurrency error from ExecUpdate and ExecDelete.
+permutation "s1b" "s2b" "s1u" "s1c" "s2d" "s2c"
+permutation "s1b" "s2b" "s1u" "s2d" "s1c" "s2c"
+permutation "s1b" "s2b" "s2d" "s1u" "s2c" "s1c"
+
+# Concurrency error from GetTupleForTrigger
+permutation "s1b" "s2b" "s1u2" "s1c" "s2u2" "s2c"
+permutation "s1b" "s2b" "s1u2" "s2u2" "s1c" "s2c"
+permutation "s1b" "s2b" "s2u2" "s1u2" "s2c" "s1c"
+
+# Concurrency error from ExecLockRows
+permutation "s1b" "s2b" "s1u3" "s2i" "s1c" "s2c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..699e2e727f
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,45 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING test
+#
+# This test tries to expose problems with the interaction between concurrent
+# sessions during an update of the partition key and INSERT...ON CONFLICT DO
+# NOTHING on a partitioned table.
+#
+# The convention here is that session 1 moves row from one partition to
+# another due update of the partition key and session 2 always ends up
+# inserting, and session 3 always ends up doing nothing.
+#
+# Note: This test is slightly resemble to insert-conflict-do-nothing test.
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c"	{ COMMIT; }
+
+session "s3"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; }
+step "s3select" { SELECT * FROM foo ORDER BY a; }
+step "s3c"	{ COMMIT; }
+
+# Regular case where one session block-waits on another to determine if it
+# should proceed with an insert or do nothing.
+permutation "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3select" "s3c"
+permutation "s2donothing" "s1u" "s3donothing" "s1c" "s2c" "s3select" "s3c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..a6efea1381
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,44 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING
+# test on partitioned table with multiple rows in higher isolation levels.
+#
+# Note: This test is resemble to insert-conflict-do-nothing-2 test
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s2begins"	{ BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c" { COMMIT; }
+step "s2select" { SELECT * FROM foo ORDER BY a; }
+
+session "s3"
+step "s3beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s3begins" { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; }
+step "s3c" { COMMIT; }
+
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
-- 
2.14.1

#76

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Andres Freund (#65)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-04-02 11:26:38 -0400, Robert Haas wrote:

On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:

[....]

I've attached a noticeably editorialized patch:

- I'm uncomfortable with the "moved" information not being crash-safe /
replicated. Thus I added a new flag to preserve it, and removed the
masking of the moved bit in the ctid from heap_mask().

- renamed macros to not mention valid / invalid block numbers, but
rather
HeapTupleHeaderSetMovedPartitions / HeapTupleHeaderIndicatesMovedPartitions
and
ItemPointerSetMovedPartitions / ItemPointerIndicatesMovedPartitions

I'm not wedded to these names, but I'l be adamant they they're not
talking about invalid block numbers. Makes code harder to understand
imo.

These names are much better than before, thanks.

One concern -- instead xxxMovedPartitions can we have
xxxPartitionChanged or xxxChangedPartition?

xxxMovedPartitions looks (at least to me) like partitions are moved. In other
databases, there is maintenance command to move a partition from one tablespace
to another, current naming is fine as long as we don't support the
same, but if we do then this names will be confusing, comments/thoughts?

Regards,
Amul

#77

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: amul sul (#75)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Apr 6, 2018 at 12:50 PM, amul sul <sulamul@gmail.com> wrote:

Updated patch attached.

+ if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to another partition
due to concurrent update")));

As suggested by Andres, I think you should change the error code to
serialization failure i.e ERRCODE_T_R_SERIALIZATION_FAILURE.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#78

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Amit Kapila (#77)

1 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Fri, Apr 6, 2018 at 1:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Apr 6, 2018 at 12:50 PM, amul sul <sulamul@gmail.com> wrote:

Updated patch attached.
+ if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to another partition
due to concurrent update")));
As suggested by Andres, I think you should change the error code to
serialization failure i.e ERRCODE_T_R_SERIALIZATION_FAILURE.

Thanks for the reminder -- fixed in the attached version.

Regards,
Amul

Attachments:

v10-0001-Raise-error-when-affecting-tuple-moved-int.patchapplication/octet-stream; name=v10-0001-Raise-error-when-affecting-tuple-moved-int.patchDownload

From 6e1946314082778fb2a6534d3beec3a0db64a089 Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Fri, 6 Apr 2018 09:41:28 +0530
Subject: [PATCH] Raise error when affecting tuple moved into different
 partition.

== CHANGES ==
v10:
 1. Replace ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE by
    ERRCODE_T_R_SERIALIZATION_FAILURE as per Andres suggestion[1]

v9:
 Its the rebase version of Andres Freund patch v8[1] with the
 following additional changes:
 1. Error message changes in ExecDelete as per Amit Kapila's
    suggestion[2]
 2. Combine isolation test specs1/2 and 3 in the specs1
 3. Argument changing_part of heap_delete renamed to ChangingPart to be
    consistent with ExecDelete

== REF ==
1] https://postgr.es/m/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de
2] https://postgr.es/m/CAAJ_b97H1hecfogVyLUZoCr_EXeTOWg5%2B2N-FUyJdcp48yXv9g%40mail.gmail.com
---
 src/backend/access/heap/heapam.c                   |  39 +++++-
 src/backend/access/heap/pruneheap.c                |   6 +
 src/backend/access/heap/rewriteheap.c              |   1 +
 src/backend/commands/trigger.c                     |   5 +
 src/backend/executor/execMain.c                    |   7 +-
 src/backend/executor/execMerge.c                   |   3 +-
 src/backend/executor/execReplication.c             |  22 +++-
 src/backend/executor/nodeLockRows.c                |   5 +
 src/backend/executor/nodeModifyTable.c             |  27 +++-
 src/include/access/heapam.h                        |   2 +-
 src/include/access/heapam_xlog.h                   |   1 +
 src/include/access/htup_details.h                  |  12 +-
 src/include/executor/nodeModifyTable.h             |   3 +-
 src/include/storage/itemptr.h                      |  16 +++
 src/test/isolation/expected/merge-update.out       |  25 ++++
 .../isolation/expected/partition-key-update-1.out  |  66 ++++++++++
 .../isolation/expected/partition-key-update-2.out  |  29 +++++
 .../isolation/expected/partition-key-update-3.out  | 139 +++++++++++++++++++++
 src/test/isolation/isolation_schedule              |   3 +
 src/test/isolation/specs/merge-update.spec         |   3 +-
 .../isolation/specs/partition-key-update-1.spec    |  78 ++++++++++++
 .../isolation/specs/partition-key-update-2.spec    |  45 +++++++
 .../isolation/specs/partition-key-update-3.spec    |  44 +++++++
 23 files changed, 558 insertions(+), 23 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-key-update-1.out
 create mode 100644 src/test/isolation/expected/partition-key-update-2.out
 create mode 100644 src/test/isolation/expected/partition-key-update-3.out
 create mode 100644 src/test/isolation/specs/partition-key-update-1.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-2.spec
 create mode 100644 src/test/isolation/specs/partition-key-update-3.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f96567f5d5..28776dbff3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2308,6 +2308,7 @@ heap_get_latest_tid(Relation relation,
 		 */
 		if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
 			HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
+			HeapTupleHeaderIndicatesMovedPartitions(tp.t_data) ||
 			ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
 		{
 			UnlockReleaseBuffer(buffer);
@@ -3041,6 +3042,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
  *	wait - true if should wait for any conflicting update to commit/abort
  *	hufd - output parameter, filled in failure cases (see below)
+ *	changingPart - true iff the tuple is being moved to another partition
+ *		table due to an update of the partition key. Otherwise, false.
  *
  * Normal, successful return value is HeapTupleMayBeUpdated, which
  * actually means we did delete it.  Failure return codes are
@@ -3056,7 +3059,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData *hufd, bool changingPart)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3325,6 +3328,10 @@ l1:
 	/* Make sure there is no forward chain link in t_ctid */
 	tp.t_data->t_ctid = tp.t_self;
 
+	/* Signal that this is actually a move into another partition */
+	if (changingPart)
+		HeapTupleHeaderSetMovedPartitions(tp.t_data);
+
 	MarkBufferDirty(buffer);
 
 	/*
@@ -3342,7 +3349,11 @@ l1:
 		if (RelationIsAccessibleInLogicalDecoding(relation))
 			log_heap_new_cid(relation, &tp);
 
-		xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+		xlrec.flags = 0;
+		if (all_visible_cleared)
+			xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+		if (changingPart)
+			xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
 		xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
 											  tp.t_data->t_infomask2);
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3450,7 +3461,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &hufd);
+						 &hufd, false /* changingPart */);
 	switch (result)
 	{
 		case HeapTupleSelfUpdated:
@@ -6051,6 +6062,7 @@ l4:
 next:
 		/* if we find the end of update chain, we're done. */
 		if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
+			HeapTupleHeaderIndicatesMovedPartitions(mytup.t_data) ||
 			ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
 			HeapTupleHeaderIsOnlyLocked(mytup.t_data))
 		{
@@ -6102,7 +6114,12 @@ static HTSU_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, ItemPointer ctid,
 						TransactionId xid, LockTupleMode mode)
 {
-	if (!ItemPointerEquals(&tuple->t_self, ctid))
+	/*
+	 * If the tuple has not been updated, or has moved into another partition
+	 * (effectively a delete) stop here.
+	 */
+	if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
+		!ItemPointerEquals(&tuple->t_self, ctid))
 	{
 		/*
 		 * If this is the first possibly-multixact-able operation in the
@@ -8495,8 +8512,11 @@ heap_xlog_delete(XLogReaderState *record)
 		if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
 			PageClearAllVisible(page);
 
-		/* Make sure there is no forward chain link in t_ctid */
-		htup->t_ctid = target_tid;
+		/* Make sure t_ctid is set correctly */
+		if (xlrec->flags & XLH_DELETE_IS_PARTITION_MOVE)
+			HeapTupleHeaderSetMovedPartitions(htup);
+		else
+			htup->t_ctid = target_tid;
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
@@ -9417,6 +9437,13 @@ heap_mask(char *pagedata, BlockNumber blkno)
 			 */
 			if (HeapTupleHeaderIsSpeculative(page_htup))
 				ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+			/*
+			 * NB: Not ignoring ctid changes due to the tuple having moved
+			 * (i.e. HeapTupleHeaderIndicatesMovedPartitions), because that's
+			 * important information that needs to be in-sync between primary
+			 * and standby, and thus is WAL logged.
+			 */
 		}
 
 		/*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f67d7d15df..c2f5343dac 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -552,6 +552,9 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		if (!HeapTupleHeaderIsHotUpdated(htup))
 			break;
 
+		/* HOT implies it can't have moved to different partition */
+		Assert(!HeapTupleHeaderIndicatesMovedPartitions(htup));
+
 		/*
 		 * Advance to next chain member.
 		 */
@@ -823,6 +826,9 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
 			if (!HeapTupleHeaderIsHotUpdated(htup))
 				break;
 
+			/* HOT implies it can't have moved to different partition */
+			Assert(!HeapTupleHeaderIndicatesMovedPartitions(htup));
+
 			nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
 			priorXmax = HeapTupleHeaderGetUpdateXid(htup);
 		}
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2588..8d3c861a33 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -424,6 +424,7 @@ rewrite_heap_tuple(RewriteState state,
 	 */
 	if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
 		  HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
+		!HeapTupleHeaderIndicatesMovedPartitions(old_tuple->t_data) &&
 		!(ItemPointerEquals(&(old_tuple->t_self),
 							&(old_tuple->t_data->t_ctid))))
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index a189356cad..a08f46ecce 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3314,6 +3314,11 @@ ltrmark:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (!ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* it was updated, so look at the updated version */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index e4d9b0b3f8..842b294699 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -2739,6 +2739,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+						ereport(ERROR,
+								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+								 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
 
 					/* Should not encounter speculative tuple on recheck */
 					Assert(!HeapTupleHeaderIsSpeculative(tuple.t_data));
@@ -2807,7 +2811,8 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 		 * As above, it should be safe to examine xmax and t_ctid without the
 		 * buffer content lock, because they can't be changing.
 		 */
-		if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
+		if (HeapTupleHeaderIndicatesMovedPartitions(tuple.t_data) ||
+			ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
 		{
 			/* deleted, so forget about it */
 			ReleaseBuffer(buffer);
diff --git a/src/backend/executor/execMerge.c b/src/backend/executor/execMerge.c
index d39ddd3034..d75d7e5ab2 100644
--- a/src/backend/executor/execMerge.c
+++ b/src/backend/executor/execMerge.c
@@ -324,7 +324,8 @@ lmerge_matched:;
 				slot = ExecDelete(mtstate, tupleid, NULL,
 								  slot, epqstate, estate,
 								  &tuple_deleted, false, &hufd, action,
-								  mtstate->canSetTag);
+								  mtstate->canSetTag,
+								  false /* changingPart */);
 
 				break;
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 971f92a938..c063d2f63b 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -191,9 +191,14 @@ retry:
 				break;
 			case HeapTupleUpdated:
 				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
@@ -349,9 +354,14 @@ retry:
 				break;
 			case HeapTupleUpdated:
 				/* XXX: Improve handling here */
-				ereport(LOG,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("concurrent update, retrying")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update, retrying")));
+				else
+					ereport(LOG,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("concurrent update, retrying")));
 				goto retry;
 			case HeapTupleInvisible:
 				elog(ERROR, "attempted to lock invisible tuple");
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b39ccf7dc1..ace126cbf2 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("tuple to be locked was already moved to another partition due to concurrent update")));
+
 				if (ItemPointerEquals(&hufd.ctid, &tuple.t_self))
 				{
 					/* Tuple was deleted, so don't return it */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0ebf37bd24..0b2c7826bf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -645,7 +645,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   bool processReturning,
 		   HeapUpdateFailureData *hufdp,
 		   MergeActionState *actionState,
-		   bool canSetTag)
+		   bool canSetTag,
+		   bool changingPart)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -744,7 +745,8 @@ ldelete:;
 							 estate->es_output_cid,
 							 estate->es_crosscheck_snapshot,
 							 true /* wait for commit */ ,
-							 &hufd);
+							 &hufd,
+							 changingPart);
 
 		/*
 		 * Copy the necessary information, if the caller has asked for it. We
@@ -803,6 +805,10 @@ ldelete:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("tuple to be deleted was already moved to another partition due to concurrent update")));
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
@@ -1157,7 +1163,7 @@ lreplace:;
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
 					   estate, &tuple_deleted, false, hufdp, NULL,
-					   false);
+					   false /* canSetTag */, true /* changingPart */);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1333,6 +1339,10 @@ lreplace:;
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 							 errmsg("could not serialize access due to concurrent update")));
+				if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+					ereport(ERROR,
+							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+							 errmsg("tuple to be updated was already moved to another partition due to concurrent update")));
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
@@ -1522,6 +1532,14 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 						 errmsg("could not serialize access due to concurrent update")));
 
+			/*
+			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+			 * a partitioned table we shouldn't reach to a case where tuple to
+			 * be lock is moved to another partition due to concurrent update
+			 * of the partition key.
+			 */
+			Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid));
+
 			/*
 			 * Tell caller to try again from the very start.
 			 *
@@ -2264,7 +2282,8 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, NULL, NULL, node->canSetTag);
+								  NULL, true, NULL, NULL, node->canSetTag,
+								  false /* changingPart */);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 608f50b061..7d756f20b0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,7 +167,7 @@ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd);
+			HeapUpdateFailureData *hufd, bool changingPart);
 extern void heap_finish_speculative(Relation relation, HeapTuple tuple);
 extern void heap_abort_speculative(Relation relation, HeapTuple tuple);
 extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 700e25c36a..3c9214da6f 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -93,6 +93,7 @@
 #define XLH_DELETE_CONTAINS_OLD_TUPLE			(1<<1)
 #define XLH_DELETE_CONTAINS_OLD_KEY				(1<<2)
 #define XLH_DELETE_IS_SUPER						(1<<3)
+#define XLH_DELETE_IS_PARTITION_MOVE			(1<<4)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLH_DELETE_CONTAINS_OLD						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cebaea097d..cf56d4ace4 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -83,8 +83,10 @@
  *
  * A word about t_ctid: whenever a new tuple is stored on disk, its t_ctid
  * is initialized with its own TID (location).  If the tuple is ever updated,
- * its t_ctid is changed to point to the replacement version of the tuple.
- * Thus, a tuple is the latest version of its row iff XMAX is invalid or
+ * its t_ctid is changed to point to the replacement version of the tuple or
+ * the block number (ip_blkid) is invalidated if the tuple is moved from one
+ * partition to another partition relation due to an update of the partition
+ * key.  Thus, a tuple is the latest version of its row iff XMAX is invalid or
  * t_ctid points to itself (in which case, if XMAX is valid, the tuple is
  * either locked or deleted).  One can follow the chain of t_ctid links
  * to find the newest version of the row.  Beware however that VACUUM might
@@ -445,6 +447,12 @@ do { \
 	ItemPointerSet(&(tup)->t_ctid, token, SpecTokenOffsetNumber) \
 )
 
+#define HeapTupleHeaderSetMovedPartitions(tup) \
+	ItemPointerSetMovedPartitions(&(tup)->t_ctid)
+
+#define HeapTupleHeaderIndicatesMovedPartitions(tup) \
+	ItemPointerIndicatesMovedPartitions(&tup->t_ctid)
+
 #define HeapTupleHeaderGetDatumLength(tup) \
 	VARSIZE(tup)
 
diff --git a/src/include/executor/nodeModifyTable.h b/src/include/executor/nodeModifyTable.h
index 94fd60c38c..7e9ab3cb6b 100644
--- a/src/include/executor/nodeModifyTable.h
+++ b/src/include/executor/nodeModifyTable.h
@@ -27,7 +27,8 @@ extern TupleTableSlot *ExecDelete(ModifyTableState *mtstate,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *planSlot,
 		   EPQState *epqstate, EState *estate, bool *tupleDeleted,
 		   bool processReturning, HeapUpdateFailureData *hufdp,
-		   MergeActionState *actionState, bool canSetTag);
+		   MergeActionState *actionState, bool canSetTag,
+		   bool changingPart);
 extern TupleTableSlot *ExecUpdate(ModifyTableState *mtstate,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *planSlot, EPQState *epqstate, EState *estate,
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 6c9ed3696b..626c98f969 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -154,6 +154,22 @@ typedef ItemPointerData *ItemPointer;
 	(pointer)->ip_posid = InvalidOffsetNumber \
 )
 
+/*
+ * ItemPointerIndicatesMovedPartitions
+ *		True iff the block number indicates the tuple has moved to another
+ *		partition.
+ */
+#define ItemPointerIndicatesMovedPartitions(pointer) \
+	!BlockNumberIsValid(ItemPointerGetBlockNumberNoCheck(pointer))
+
+/*
+ * ItemPointerSetMovedPartitions
+ *		Indicate that the item referenced by the itempointer has moved into a
+ *		different partition.
+ */
+#define ItemPointerSetMovedPartitions(pointer) \
+	ItemPointerSetBlockNumber((pointer), InvalidBlockNumber)
+
 /* ----------------
  *		externs
  * ----------------
diff --git a/src/test/isolation/expected/merge-update.out b/src/test/isolation/expected/merge-update.out
index 60ae42ebd0..00069a3e45 100644
--- a/src/test/isolation/expected/merge-update.out
+++ b/src/test/isolation/expected/merge-update.out
@@ -204,6 +204,31 @@ step pa_merge2a:
  <waiting ...>
 step c1: COMMIT;
 step pa_merge2a: <... completed>
+error in steps c1 pa_merge2a: ERROR:  tuple to be deleted was already moved to another partition due to concurrent update
+step pa_select2: SELECT * FROM pa_target;
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
+step c2: COMMIT;
+
+starting permutation: pa_merge2 c1 pa_merge2a pa_select2 c2
+step pa_merge2: 
+  MERGE INTO pa_target t
+  USING (SELECT 1 as key, 'pa_merge1' as val) s
+  ON s.key = t.key
+  WHEN NOT MATCHED THEN
+	INSERT VALUES (s.key, s.val)
+  WHEN MATCHED THEN
+    UPDATE set key = t.key + 1, val = t.val || ' updated by ' || s.val;
+
+step c1: COMMIT;
+step pa_merge2a: 
+  MERGE INTO pa_target t
+  USING (SELECT 1 as key, 'pa_merge2a' as val) s
+  ON s.key = t.key
+  WHEN NOT MATCHED THEN
+	INSERT VALUES (s.key, s.val)
+  WHEN MATCHED THEN
+	UPDATE set key = t.key + 1, val = t.val || ' updated by ' || s.val;
+
 step pa_select2: SELECT * FROM pa_target;
 key            val            
 
diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000000..af92fbe1f7
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,66 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s2b s1u s1c s2d s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s1c: COMMIT;
+step s2d: DELETE FROM foo WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1b s2b s1u s2d s1c s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u: UPDATE foo SET a=2 WHERE a=1;
+step s2d: DELETE FROM foo WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2d: <... completed>
+error in steps s1c s2d: ERROR:  tuple to be deleted was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s1b s2b s2d s1u s2c s1c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2d: DELETE FROM foo WHERE a=1;
+step s1u: UPDATE foo SET a=2 WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+
+starting permutation: s1b s2b s1u2 s1c s2u2 s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u2: UPDATE footrg SET b='EFG' WHERE a=1;
+step s1c: COMMIT;
+step s2u2: UPDATE footrg SET b='XYZ' WHERE a=1;
+step s2c: COMMIT;
+
+starting permutation: s1b s2b s1u2 s2u2 s1c s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u2: UPDATE footrg SET b='EFG' WHERE a=1;
+step s2u2: UPDATE footrg SET b='XYZ' WHERE a=1; <waiting ...>
+step s1c: COMMIT;
+step s2u2: <... completed>
+error in steps s1c s2u2: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
+
+starting permutation: s1b s2b s2u2 s1u2 s2c s1c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u2: UPDATE footrg SET b='XYZ' WHERE a=1;
+step s1u2: UPDATE footrg SET b='EFG' WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u2: <... completed>
+error in steps s2c s1u2: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s1c: COMMIT;
+
+starting permutation: s1b s2b s1u3 s2i s1c s2c
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s1u3: UPDATE foo_rang_parted SET a=11 WHERE a=7 AND b = 'ABC';
+step s2i: INSERT INTO bar VALUES(7); <waiting ...>
+step s1c: COMMIT;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
+step s2c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/partition-key-update-2.out
new file mode 100644
index 0000000000..363de0d69c
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-2.out
@@ -0,0 +1,29 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1u s2donothing s3donothing s1c s2c s3select s3c
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
+
+starting permutation: s2donothing s1u s3donothing s1c s2c s3select s3c
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2c: COMMIT;
+step s3select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+2              initial tuple -> moved by session-1
+step s3c: COMMIT;
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/partition-key-update-3.out
new file mode 100644
index 0000000000..42dfe64ad3
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-3.out
@@ -0,0 +1,139 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2beginrr s3beginrr s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s3beginrr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s1c s2c s3donothing s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s2c: COMMIT;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s1c s3c s2donothing s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+error in steps s1c s3donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s2donothing s3donothing s1c s2c s3c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s2donothing: <... completed>
+step s3donothing: <... completed>
+error in steps s1c s2donothing s3donothing: ERROR:  could not serialize access due to concurrent update
+step s2c: COMMIT;
+step s3c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
+
+starting permutation: s2begins s3begins s1u s3donothing s2donothing s1c s3c s2c s2select
+step s2begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s3begins: BEGIN ISOLATION LEVEL SERIALIZABLE;
+step s1u: UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1;
+step s3donothing: INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; <waiting ...>
+step s2donothing: INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; <waiting ...>
+step s1c: COMMIT;
+step s3donothing: <... completed>
+step s2donothing: <... completed>
+error in steps s1c s3donothing s2donothing: ERROR:  could not serialize access due to concurrent update
+step s3c: COMMIT;
+step s2c: COMMIT;
+step s2select: SELECT * FROM foo ORDER BY a;
+a              b              
+
+1              session-2 donothing
+2              initial tuple -> moved by session-1
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 31900cb920..fdff58deb9 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -72,6 +72,9 @@ test: timeouts
 test: vacuum-concurrent-drop
 test: predicate-gist
 test: predicate-gin
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
 # The checksum_enable suite will enable checksums for the cluster so should
 # not run before anything expecting the cluster to have checksums turned off
 test: checksum_cancel
diff --git a/src/test/isolation/specs/merge-update.spec b/src/test/isolation/specs/merge-update.spec
index 64e849966e..625b477eb9 100644
--- a/src/test/isolation/specs/merge-update.spec
+++ b/src/test/isolation/specs/merge-update.spec
@@ -129,4 +129,5 @@ permutation "merge1" "merge2a" "a1" "select2" "c2"
 permutation "merge1" "merge2b" "c1" "select2" "c2"
 permutation "merge1" "merge2c" "c1" "select2" "c2"
 permutation "pa_merge1" "pa_merge2a" "c1" "pa_select2" "c2"
-permutation "pa_merge2" "pa_merge2a" "c1" "pa_select2" "c2"
+permutation "pa_merge2" "pa_merge2a" "c1" "pa_select2" "c2" # fails
+permutation "pa_merge2" "c1" "pa_merge2a" "pa_select2" "c2" # succeeds
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000000..2cccffa7b4
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,78 @@
+# Throw an error to indicate that the targeted row has been already moved to
+# another partition in the case of concurrency where a session trying to
+# lock/update/delete a row that's locked for a concurrent update by the another
+# session cause tuple movement to the another partition due update of partition
+# key.
+
+setup
+{
+  --
+  -- Setup to test an error from ExecUpdate and ExecDelete.
+  --
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+
+  --
+  -- Setup to test an error from GetTupleForTrigger
+  --
+  CREATE TABLE footrg (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE footrg1 PARTITION OF footrg FOR VALUES IN (1);
+  CREATE TABLE footrg2 PARTITION OF footrg FOR VALUES IN (2);
+  INSERT INTO footrg VALUES (1, 'ABC');
+  CREATE FUNCTION func_footrg_mod_a() RETURNS TRIGGER AS $$
+    BEGIN
+	  NEW.a = 2; -- This is changing partition key column.
+   RETURN NEW;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_mod_a BEFORE UPDATE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg_mod_a();
+
+  --
+  -- Setup to test an error from ExecLockRows
+  --
+  CREATE TABLE foo_rang_parted (a int, b text) PARTITION BY RANGE(a);
+  CREATE TABLE foo_rang_parted1 PARTITION OF foo_rang_parted FOR VALUES FROM (1) TO (10);
+  CREATE TABLE foo_rang_parted2 PARTITION OF foo_rang_parted FOR VALUES FROM (10) TO (20);
+  INSERT INTO foo_rang_parted VALUES(7, 'ABC');
+  CREATE UNIQUE INDEX foo_rang_parted1_a_unique ON foo_rang_parted1 (a);
+  CREATE TABLE bar (a int REFERENCES foo_rang_parted1(a));
+}
+
+teardown
+{
+  DROP TABLE foo;
+  DROP TRIGGER footrg_mod_a ON footrg1;
+  DROP FUNCTION func_footrg_mod_a();
+  DROP TABLE footrg;
+  DROP TABLE bar, foo_rang_parted;
+}
+
+session "s1"
+step "s1b"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
+step "s1u2"	{ UPDATE footrg SET b='EFG' WHERE a=1; }
+step "s1u3"	{ UPDATE foo_rang_parted SET a=11 WHERE a=7 AND b = 'ABC'; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2b"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
+step "s2u2"	{ UPDATE footrg SET b='XYZ' WHERE a=1; }
+step "s2i"	{ INSERT INTO bar VALUES(7); }
+step "s2d"	{ DELETE FROM foo WHERE a=1; }
+step "s2c"	{ COMMIT; }
+
+# Concurrency error from ExecUpdate and ExecDelete.
+permutation "s1b" "s2b" "s1u" "s1c" "s2d" "s2c"
+permutation "s1b" "s2b" "s1u" "s2d" "s1c" "s2c"
+permutation "s1b" "s2b" "s2d" "s1u" "s2c" "s1c"
+
+# Concurrency error from GetTupleForTrigger
+permutation "s1b" "s2b" "s1u2" "s1c" "s2u2" "s2c"
+permutation "s1b" "s2b" "s1u2" "s2u2" "s1c" "s2c"
+permutation "s1b" "s2b" "s2u2" "s1u2" "s2c" "s1c"
+
+# Concurrency error from ExecLockRows
+permutation "s1b" "s2b" "s1u3" "s2i" "s1c" "s2c"
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/partition-key-update-2.spec
new file mode 100644
index 0000000000..699e2e727f
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-2.spec
@@ -0,0 +1,45 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING test
+#
+# This test tries to expose problems with the interaction between concurrent
+# sessions during an update of the partition key and INSERT...ON CONFLICT DO
+# NOTHING on a partitioned table.
+#
+# The convention here is that session 1 moves row from one partition to
+# another due update of the partition key and session 2 always ends up
+# inserting, and session 3 always ends up doing nothing.
+#
+# Note: This test is slightly resemble to insert-conflict-do-nothing test.
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c"	{ COMMIT; }
+
+session "s3"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing') ON CONFLICT DO NOTHING; }
+step "s3select" { SELECT * FROM foo ORDER BY a; }
+step "s3c"	{ COMMIT; }
+
+# Regular case where one session block-waits on another to determine if it
+# should proceed with an insert or do nothing.
+permutation "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3select" "s3c"
+permutation "s2donothing" "s1u" "s3donothing" "s1c" "s2c" "s3select" "s3c"
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/partition-key-update-3.spec
new file mode 100644
index 0000000000..a6efea1381
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-3.spec
@@ -0,0 +1,44 @@
+# Concurrent update of a partition key and INSERT...ON CONFLICT DO NOTHING
+# test on partitioned table with multiple rows in higher isolation levels.
+#
+# Note: This test is resemble to insert-conflict-do-nothing-2 test
+
+setup
+{
+  CREATE TABLE foo (a int primary key, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'initial tuple');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"	{ UPDATE foo SET a=2, b=b || ' -> moved by session-1' WHERE a=1; }
+step "s1c"	{ COMMIT; }
+
+session "s2"
+step "s2beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s2begins"	{ BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s2donothing" { INSERT INTO foo VALUES(1, 'session-2 donothing') ON CONFLICT DO NOTHING; }
+step "s2c" { COMMIT; }
+step "s2select" { SELECT * FROM foo ORDER BY a; }
+
+session "s3"
+step "s3beginrr" { BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "s3begins" { BEGIN ISOLATION LEVEL SERIALIZABLE; }
+step "s3donothing" { INSERT INTO foo VALUES(2, 'session-3 donothing'), (2, 'session-3 donothing2') ON CONFLICT DO NOTHING; }
+step "s3c" { COMMIT; }
+
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2beginrr" "s3beginrr" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s1c" "s2c" "s3donothing" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s1c" "s3c" "s2donothing" "s2c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s2donothing" "s3donothing" "s1c" "s2c" "s3c" "s2select"
+permutation "s2begins" "s3begins" "s1u" "s3donothing" "s2donothing" "s1c" "s3c" "s2c" "s2select"
-- 
2.14.1

#79

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: amul sul (#78)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

Hi Tom, All,

On 2018-04-06 14:19:02 +0530, amul sul wrote:

Thanks for the reminder -- fixed in the attached version.

Tom, this seems to be the best approach for fixing the visibility issues
around this. I've spent a good chunk of time looking at corruption
issues like the ones you feared (see [1]/messages/by-id/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de) and I'm not particularly
concerned. I'm currently planning to go ahead with this, do you want to
"veto" that (informally, not formally)?

I'll go through this again tomorrow morning.

[1]: /messages/by-id/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de

v9:
Its the rebase version of Andres Freund patch v8[1] with the
following additional changes:
3. Argument changing_part of heap_delete renamed to ChangingPart to be
consistent with ExecDelete

FWIW, I'd left it as it was before because the two functions have a bit
different coding style, and the capitalization seemed more fitting in
the surrounding context.

+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3

Can you give these more descriptive names please (or further combine them)?

Greetings,

Andres Freund

#80

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Andres Freund (#79)

1 attachment(s)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Sat, Apr 7, 2018 at 9:08 AM, Andres Freund <andres@anarazel.de> wrote:

Hi Tom, All,

On 2018-04-06 14:19:02 +0530, amul sul wrote:

Thanks for the reminder -- fixed in the attached version.

Tom, this seems to be the best approach for fixing the visibility issues
around this. I've spent a good chunk of time looking at corruption
issues like the ones you feared (see [1]) and I'm not particularly
concerned. I'm currently planning to go ahead with this, do you want to
"veto" that (informally, not formally)?

I'll go through this again tomorrow morning.

[1] /messages/by-id/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de

v9:
Its the rebase version of Andres Freund patch v8[1] with the
following additional changes:
3. Argument changing_part of heap_delete renamed to ChangingPart to be
consistent with ExecDelete

FWIW, I'd left it as it was before because the two functions have a bit
different coding style, and the capitalization seemed more fitting in
the surrounding context.
+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
Can you give these more descriptive names please (or further combine them)?

As I explained above further combining might not the good option and about
the descriptive names I have following suggestions but I am afraid of
the length of
test case name:

+test: concurrent-partition-key-update.out

This test does the serialization failure check.

+test: concurrent-partition-key-update-and-insert-conflict-do-nothing-1
+test: concurrent-partition-key-update-and-insert-conflict-do-nothing-2

Both are testing partition key update behaviour with the insert on
conflict do nothing case.

Attached is the patch does the renaming of this tests -- need to apply
to the top of v10 patch[1].

Regards,
Amul

1] /messages/by-id/CAAJ_b94X5Y_zdTN=BGdZie+hM4p6qW70-XCJhFYaCUO0OfF=aQ@mail.gmail.com

Attachments:

v10-0002-Rename-isolation-test-name.patchapplication/octet-stream; name=v10-0002-Rename-isolation-test-name.patchDownload

From 04cf425811423384a3bc9806ad238109169520be Mon Sep 17 00:00:00 2001
From: Amul Sul <sulamul@gmail.com>
Date: Sat, 7 Apr 2018 20:06:13 +0530
Subject: [PATCH 2/2] Rename isolation test name

---
 ...rrent-partition-key-update-and-insert-conflict-do-nothing-1.out} | 0
 ...rrent-partition-key-update-and-insert-conflict-do-nothing-2.out} | 0
 ...ion-key-update-1.out => concurrent-partition-key-update.out.out} | 0
 src/test/isolation/isolation_schedule                               | 6 +++---
 ...rent-partition-key-update-and-insert-conflict-do-nothing-1.spec} | 0
 ...rent-partition-key-update-and-insert-conflict-do-nothing-2.spec} | 0
 ...n-key-update-1.spec => concurrent-partition-key-update.out.spec} | 0
 7 files changed, 3 insertions(+), 3 deletions(-)
 rename src/test/isolation/expected/{partition-key-update-2.out => concurrent-partition-key-update-and-insert-conflict-do-nothing-1.out} (100%)
 rename src/test/isolation/expected/{partition-key-update-3.out => concurrent-partition-key-update-and-insert-conflict-do-nothing-2.out} (100%)
 rename src/test/isolation/expected/{partition-key-update-1.out => concurrent-partition-key-update.out.out} (100%)
 rename src/test/isolation/specs/{partition-key-update-2.spec => concurrent-partition-key-update-and-insert-conflict-do-nothing-1.spec} (100%)
 rename src/test/isolation/specs/{partition-key-update-3.spec => concurrent-partition-key-update-and-insert-conflict-do-nothing-2.spec} (100%)
 rename src/test/isolation/specs/{partition-key-update-1.spec => concurrent-partition-key-update.out.spec} (100%)

diff --git a/src/test/isolation/expected/partition-key-update-2.out b/src/test/isolation/expected/concurrent-partition-key-update-and-insert-conflict-do-nothing-1.out
similarity index 100%
rename from src/test/isolation/expected/partition-key-update-2.out
rename to src/test/isolation/expected/concurrent-partition-key-update-and-insert-conflict-do-nothing-1.out
diff --git a/src/test/isolation/expected/partition-key-update-3.out b/src/test/isolation/expected/concurrent-partition-key-update-and-insert-conflict-do-nothing-2.out
similarity index 100%
rename from src/test/isolation/expected/partition-key-update-3.out
rename to src/test/isolation/expected/concurrent-partition-key-update-and-insert-conflict-do-nothing-2.out
diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/concurrent-partition-key-update.out.out
similarity index 100%
rename from src/test/isolation/expected/partition-key-update-1.out
rename to src/test/isolation/expected/concurrent-partition-key-update.out.out
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index fdff58deb9..7a36cb4591 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -72,9 +72,9 @@ test: timeouts
 test: vacuum-concurrent-drop
 test: predicate-gist
 test: predicate-gin
-test: partition-key-update-1
-test: partition-key-update-2
-test: partition-key-update-3
+test: concurrent-partition-key-update.out
+test: concurrent-partition-key-update-and-insert-conflict-do-nothing-1
+test: concurrent-partition-key-update-and-insert-conflict-do-nothing-2
 # The checksum_enable suite will enable checksums for the cluster so should
 # not run before anything expecting the cluster to have checksums turned off
 test: checksum_cancel
diff --git a/src/test/isolation/specs/partition-key-update-2.spec b/src/test/isolation/specs/concurrent-partition-key-update-and-insert-conflict-do-nothing-1.spec
similarity index 100%
rename from src/test/isolation/specs/partition-key-update-2.spec
rename to src/test/isolation/specs/concurrent-partition-key-update-and-insert-conflict-do-nothing-1.spec
diff --git a/src/test/isolation/specs/partition-key-update-3.spec b/src/test/isolation/specs/concurrent-partition-key-update-and-insert-conflict-do-nothing-2.spec
similarity index 100%
rename from src/test/isolation/specs/partition-key-update-3.spec
rename to src/test/isolation/specs/concurrent-partition-key-update-and-insert-conflict-do-nothing-2.spec
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/concurrent-partition-key-update.out.spec
similarity index 100%
rename from src/test/isolation/specs/partition-key-update-1.spec
rename to src/test/isolation/specs/concurrent-partition-key-update.out.spec
-- 
2.14.1

#81

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: amul sul (#80)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

amul sul wrote:

On Sat, Apr 7, 2018 at 9:08 AM, Andres Freund <andres@anarazel.de> wrote:

+test: partition-key-update-1
+test: partition-key-update-2
+test: partition-key-update-3
Can you give these more descriptive names please (or further combine them)?
As I explained above further combining might not the good option and about
the descriptive names I have following suggestions but I am afraid of
the length of
test case name:

+test: concurrent-partition-key-update.out

This test does the serialization failure check.
+test: concurrent-partition-key-update-and-insert-conflict-do-nothing-1
+test: concurrent-partition-key-update-and-insert-conflict-do-nothing-2

Yikes. I'd rather have the original name, and each test's purpose
stated in a comment in the spec file itself.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#82

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: Amit Kapila (#73)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On 2018-04-06 09:41:07 +0530, Amit Kapila wrote:

Won't the same question applies to the similar usage in
EvalPlanQualFetch and heap_lock_updated_tuple_rec.

I don't think so?

In EvalPlanQualFetch, we consider such a tuple to be deleted and will
silently miss/skip it which seems contradictory to the places where we
have detected such a situation and raised an error.

if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("tuple to be locked was already moved to another partition due to concurrent update")));

I was talking about the case when the tuple version is not visible aka
the below code:

I think if we return an error in EvalPlanQualFetch at the place
mentioned above, the behavior will be sane.

I think you're right. I've adapted the code, added a bunch of tests.

Greetings,

Andres Freund

#83

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: amul sul (#80)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On 2018-04-07 20:13:36 +0530, amul sul wrote:

Attached is the patch does the renaming of this tests -- need to apply
to the top of v10 patch[1].

These indeed are a bit too long, so I went with the numbers. I've
pushed the patch now. Two changes:
- I've added one more error patch to EvalPlanQualFetch, as suggested by
Amit. Added tests for that too.
- renamed '*_rang_*' table names in tests to range.

Thanks!

- Andres

#84

amul sul

sulamul@gmail.com

almost 8 years ago

In reply to: Andres Freund (#83)

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

On Sun, Apr 8, 2018 at 2:04 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-04-07 20:13:36 +0530, amul sul wrote:

Attached is the patch does the renaming of this tests -- need to apply
to the top of v10 patch[1].

These indeed are a bit too long, so I went with the numbers. I've
pushed the patch now. Two changes:
- I've added one more error patch to EvalPlanQualFetch, as suggested by
Amit. Added tests for that too.
- renamed '*_rang_*' table names in tests to range.

Thanks for the review and commit -- I appreciate your and Amit Kapila's help to
push the patch to the committable stage. Thanks to all those who have
looked into the patch and provided valuable inputs.

FWIW here is the commit :
/messages/by-id/E1f4uWV-00057x-0a@gemulon.postgresql.org

Regards,
Amul