Concurrency bug in UPDATE of partition-key

Started by Amit Khandekaralmost 8 years ago39 messages

amitdkhan.pg@gmail.com

almost 8 years ago

1 attachment(s)

(Mail subject changed; original thread : [1]/messages/by-id/CAJ3gD9d=wcbF-G00bYo4v5iWC69zm1UTSNxyRLpOgO9j4AtLWQ@mail.gmail.com)

On 8 March 2018 at 11:57, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 8 March 2018 at 09:15, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

CREATE TABLE pa_target (key integer, val text)
PARTITION BY LIST (key);
CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
INSERT INTO pa_target VALUES (1, 'initial1');

session1:
BEGIN;
UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
UPDATE 1
postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
1 | initial1 updated by update1
(1 row)

session2:
UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE
key = 1
<blocks>

session1:
postgres=# COMMIT;
COMMIT

<session1 unblocks and completes its UPDATE>

postgres=# SELECT * FROM pa_target ;
key | val
-----+-----------------------------
2 | initial1 updated by update2
(1 row)

Ouch. The committed updates by session1 are overwritten by session2. This
clearly violates the rules that rest of the system obeys and is not
acceptable IMHO.

Clearly, ExecUpdate() while moving rows between partitions is missing out on
re-constructing the to-be-updated tuple, based on the latest tuple in the
update chain. Instead, it's simply deleting the latest tuple and inserting a
new tuple in the new partition based on the old tuple. That's simply wrong.

You are right. This need to be fixed. This is a different issue than
the particular one that is being worked upon in this thread, and both
these issues have different fixes.

As discussed above, there is a concurrency bug where during UPDATE of
a partition-key, another transaction T2 updates the same row.

Like you said, the tuple needs to be reconstructed when ExecDelete()
finds that the row has been updated by another transaction. We should
send back this information from ExecDelete() (I think tupleid
parameter gets updated in this case), and then in ExecUpdate() we
should goto lreplace, so that the the row is fetched back similar to
how it happens when heap_update() knows that the tuple was updated.

The attached WIP patch is how the fix is turning out to be.
ExecDelete() passes back the epqslot returned by EvalPlanQual().
ExecUpdate() uses this re-fetched tuple to re-process the row, just
like it does when heap_update() returns HeapTupleUpdated.

I need to write an isolation test for this fix.

[1]: /messages/by-id/CAJ3gD9d=wcbF-G00bYo4v5iWC69zm1UTSNxyRLpOgO9j4AtLWQ@mail.gmail.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

fix_concurrency_bug_WIP.patchapplication/octet-stream; name=fix_concurrency_bug_WIP.patchDownload

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c32928d..b74781d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -705,7 +705,10 @@ ExecInsert(ModifyTableState *mtstate,
  *		foreign table, tupleid is invalid; the FDW has to figure out
  *		which row to delete using data from the planSlot.  oldtuple is
  *		passed to foreign table triggers; it is NULL when the foreign
- *		table has no relevant triggers.
+ *		table has no relevant triggers. When this DELETE is a part of
+ *		an UPDATE of partition-key, then the slot returned by
+ *		EvalPlanQual() is passed back using output parameter epqslot,
+ *		so that the caller (ExecUpdate) can update this re-fetched row.
  *
  *		Returns RETURNING result if any, otherwise NULL.
  * ----------------------------------------------------------------
@@ -719,7 +722,8 @@ ExecDelete(ModifyTableState *mtstate,
 		   EState *estate,
 		   bool *tupleDeleted,
 		   bool processReturning,
-		   bool canSetTag)
+		   bool canSetTag,
+		   TupleTableSlot **epqslot)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -858,19 +862,33 @@ ldelete:;
 							 errmsg("could not serialize access due to concurrent update")));
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
-					TupleTableSlot *epqslot;
+					TupleTableSlot *my_epqslot;
 
-					epqslot = EvalPlanQual(estate,
+					my_epqslot = EvalPlanQual(estate,
 										   epqstate,
 										   resultRelationDesc,
 										   resultRelInfo->ri_RangeTableIndex,
 										   LockTupleExclusive,
 										   &hufd.ctid,
 										   hufd.xmax);
-					if (!TupIsNull(epqslot))
+					if (!TupIsNull(my_epqslot))
 					{
 						*tupleid = hufd.ctid;
-						goto ldelete;
+
+						/*
+						 * If this is part of an UPDATE of partition-key, let
+						 * the caller process this re-fetched row.
+						 */
+						if (epqslot)
+						{
+							*epqslot = my_epqslot;
+							return NULL;
+						}
+						else
+						{
+							/* Normal DELETE: So delete the re-fetched row. */
+							goto ldelete;
+						}
 					}
 				}
 				/* tuple already deleted; nothing to do */
@@ -1141,6 +1159,7 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleTableSlot *epqslot = NULL;
 			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
@@ -1158,7 +1177,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, false, false, &epqslot);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1181,7 +1200,23 @@ lreplace:;
 			 * resurrect it.
 			 */
 			if (!tuple_deleted)
-				return NULL;
+			{
+				/*
+				 * epqslot will be typically NULL. But when ExecDelete() finds
+				 * that another transaction has concurrently updated the same
+				 * row, it re-fetches the row, skips the delete, and epqslot is
+				 * set to the re-fetched tuple slot. In that case, we need to
+				 * do all the checks again.
+				 */
+				if (TupIsNull(epqslot))
+					return NULL;
+				else
+				{
+					slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+					tuple = ExecMaterializeSlot(slot);
+					goto lreplace;
+				}
+			}
 
 			/*
 			 * Updates set the transition capture map only when a new subplan
@@ -2062,7 +2097,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, true, node->canSetTag, NULL);
 				break;
 			default:
 				elog(ERROR, "unknown operation");

Amit Kapila

amit.kapila16@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#1)

Re: Concurrency bug in UPDATE of partition-key

On Thu, Mar 8, 2018 at 4:55 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

(Mail subject changed; original thread : [1])

On 8 March 2018 at 11:57, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Clearly, ExecUpdate() while moving rows between partitions is missing out on
re-constructing the to-be-updated tuple, based on the latest tuple in the
update chain. Instead, it's simply deleting the latest tuple and inserting a
new tuple in the new partition based on the old tuple. That's simply wrong.

You are right. This need to be fixed. This is a different issue than
the particular one that is being worked upon in this thread, and both
these issues have different fixes.

As discussed above, there is a concurrency bug where during UPDATE of
a partition-key, another transaction T2 updates the same row.

We have one more behavior related to concurrency that would be
impacted due to row movement. Basically, now Select .. for Key Share
will block the Update that routes the tuple to a different partition
even if the update doesn't update the key (primary/unique key) column.
Example,

create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'Initial record');

Session-1
---------------
begin;
select * from foo where a=1 for key share;

Session-2
---------------
begin;
update foo set a=1, b= b|| '-> update 1' where a=1;
update foo set a=2, b= b|| '-> update 2' where a=1; --this will block

You can see when the update moves the row to a different partition, it
will block as internally it performs delete. I think as we have
already documented that such an update is internally Delete+Insert,
this behavior is expected, but I think it is better if we update the
docs (Row-level locks section) as well.

In particular, I am talking about below text in Row-level locks section [1]https://www.postgresql.org/docs/devel/static/explicit-locking.html#LOCKING-ROWS.
FOR KEY SHARE
Behaves similarly to FOR SHARE, except that the lock is weaker: SELECT
FOR UPDATE is blocked, but not SELECT FOR NO KEY UPDATE. A key-shared
lock blocks other transactions from performing DELETE or any UPDATE
that changes the key values, but not other UPDATE,

[1]: https://www.postgresql.org/docs/devel/static/explicit-locking.html#LOCKING-ROWS

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Amit Khandekar

amitdkhan.pg@gmail.com

almost 8 years ago

In reply to: Amit Khandekar (#1)

1 attachment(s)

Re: Concurrency bug in UPDATE of partition-key

On 8 March 2018 at 16:55, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The attached WIP patch is how the fix is turning out to be.
ExecDelete() passes back the epqslot returned by EvalPlanQual().
ExecUpdate() uses this re-fetched tuple to re-process the row, just
like it does when heap_update() returns HeapTupleUpdated.

I need to write an isolation test for this fix.

Attached patch now has the isolation test included. The only
difference with the earlier WIP patch is about some cosmetic changes,
and some more comments.

[1] : /messages/by-id/CAJ3gD9d=wcbF-G00bYo4v5iWC69zm1UTSNxyRLpOgO9j4AtLWQ@mail.gmail.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

fix_concurrency_bug.patchapplication/octet-stream; name=fix_concurrency_bug.patchDownload

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c32928d..75b7e10 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -705,7 +705,9 @@ ExecInsert(ModifyTableState *mtstate,
  *		foreign table, tupleid is invalid; the FDW has to figure out
  *		which row to delete using data from the planSlot.  oldtuple is
  *		passed to foreign table triggers; it is NULL when the foreign
- *		table has no relevant triggers.
+ *		table has no relevant triggers. When this DELETE is a part of
+ *		an UPDATE of partition-key, then the slot returned by
+ *		EvalPlanQual() is passed back using output parameter epqslot.
  *
  *		Returns RETURNING result if any, otherwise NULL.
  * ----------------------------------------------------------------
@@ -718,6 +720,7 @@ ExecDelete(ModifyTableState *mtstate,
 		   EPQState *epqstate,
 		   EState *estate,
 		   bool *tupleDeleted,
+		   TupleTableSlot **epqslot,
 		   bool processReturning,
 		   bool canSetTag)
 {
@@ -858,19 +861,37 @@ ldelete:;
 							 errmsg("could not serialize access due to concurrent update")));
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
-					TupleTableSlot *epqslot;
+					TupleTableSlot *my_epqslot;
 
-					epqslot = EvalPlanQual(estate,
+					my_epqslot = EvalPlanQual(estate,
 										   epqstate,
 										   resultRelationDesc,
 										   resultRelInfo->ri_RangeTableIndex,
 										   LockTupleExclusive,
 										   &hufd.ctid,
 										   hufd.xmax);
-					if (!TupIsNull(epqslot))
+					if (!TupIsNull(my_epqslot))
 					{
 						*tupleid = hufd.ctid;
-						goto ldelete;
+
+						/*
+						 * If this is part of an UPDATE of partition-key, the
+						 * epq tuple will contain the changes from this
+						 * transaction over and above the updates done by the
+						 * other transaction. The caller should now use this
+						 * tuple as its NEW tuple, rather than the earlier NEW
+						 * tuple.
+						 */
+						if (epqslot)
+						{
+							*epqslot = my_epqslot;
+							return NULL;
+						}
+						else
+						{
+							/* Normal DELETE: So delete the re-fetched row. */
+							goto ldelete;
+						}
 					}
 				}
 				/* tuple already deleted; nothing to do */
@@ -1141,6 +1162,7 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleTableSlot *epqslot = NULL;
 			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
@@ -1158,7 +1180,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-					   &tuple_deleted, false, false);
+					   &tuple_deleted, &epqslot, false, false);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1181,7 +1203,23 @@ lreplace:;
 			 * resurrect it.
 			 */
 			if (!tuple_deleted)
-				return NULL;
+			{
+				/*
+				 * epqslot will be typically NULL. But when ExecDelete() finds
+				 * that another transaction has concurrently updated the same
+				 * row, it re-fetches the row, skips the delete, and epqslot is
+				 * set to the re-fetched tuple slot. In that case, we need to
+				 * do all the checks again.
+				 */
+				if (TupIsNull(epqslot))
+					return NULL;
+				else
+				{
+					slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+					tuple = ExecMaterializeSlot(slot);
+					goto lreplace;
+				}
+			}
 
 			/*
 			 * Updates set the transition capture map only when a new subplan
@@ -2062,7 +2100,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag);
+								  NULL, NULL, true, node->canSetTag);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
new file mode 100644
index 0000000..79eb899
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -0,0 +1,12 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2u s1u s2c s1c s1s
+step s2u: UPDATE foo SET val = val || ' update2' WHERE key = 1;
+step s1u: UPDATE foo SET key = key + 1, val = val || ' update1' WHERE key=1; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT * FROM foo ORDER BY 1;
+key            val            
+
+2              initial update2 update1
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 74d7d59..fbe325c 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -66,3 +66,4 @@ test: async-notify
 test: vacuum-reltuples
 test: timeouts
 test: vacuum-concurrent-drop
+test: partition-key-update-1
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
new file mode 100644
index 0000000..da27a98
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -0,0 +1,31 @@
+# Concurrency behaviour from ExecUpdate and ExecDelete.
+
+# Transaction A is moving a row into another partition, but is waiting for
+# another transaction B that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by transaction B.
+
+setup
+{
+   CREATE TABLE foo (key integer, val text) PARTITION BY LIST (key);
+   CREATE TABLE foo_part1 PARTITION OF foo FOR VALUES IN (1);
+   CREATE TABLE foo_part2 PARTITION OF foo FOR VALUES IN (2);
+   INSERT INTO foo VALUES (1, 'initial');
+}
+
+teardown
+{
+  DROP TABLE foo;
+}
+
+session "s1"
+setup		{ BEGIN; }
+step "s1u"	{ UPDATE foo SET key = key + 1, val = val || ' update1' WHERE key=1; }
+step "s1c"	{ COMMIT; }
+step "s1s"	{ SELECT * FROM foo ORDER BY 1; }
+
+session "s2"
+setup		{ BEGIN; }
+step "s2u"	{ UPDATE foo SET val = val || ' update2' WHERE key = 1; }
+step "s2c"	{ COMMIT; }
+
+permutation "s2u" "s1u" "s2c" "s1c" "s1s"

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#3)

1 attachment(s)

Re: Concurrency bug in UPDATE of partition-key

Attached is a rebased patch version. Also included it in the upcoming
commitfest :
https://commitfest.postgresql.org/18/1660/

In the rebased version, the new test cases are added in the existing
isolation/specs/partition-key-update-1.spec test.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

fix_concurrency_bug_rebased.patchapplication/octet-stream; name=fix_concurrency_bug_rebased.patchDownload

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c4c841c..63fd001 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -615,7 +615,9 @@ ExecInsert(ModifyTableState *mtstate,
  *		foreign table, tupleid is invalid; the FDW has to figure out
  *		which row to delete using data from the planSlot.  oldtuple is
  *		passed to foreign table triggers; it is NULL when the foreign
- *		table has no relevant triggers.
+ *		table has no relevant triggers. When this DELETE is a part of
+ *		an UPDATE of partition-key, then the slot returned by
+ *		EvalPlanQual() is passed back using output parameter epqslot.
  *
  *		Returns RETURNING result if any, otherwise NULL.
  * ----------------------------------------------------------------
@@ -628,6 +630,7 @@ ExecDelete(ModifyTableState *mtstate,
 		   EPQState *epqstate,
 		   EState *estate,
 		   bool *tupleDeleted,
+		   TupleTableSlot **epqslot,
 		   bool processReturning,
 		   bool canSetTag,
 		   bool changingPart)
@@ -775,19 +778,37 @@ ldelete:;
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
-					TupleTableSlot *epqslot;
+					TupleTableSlot *my_epqslot;
 
-					epqslot = EvalPlanQual(estate,
+					my_epqslot = EvalPlanQual(estate,
 										   epqstate,
 										   resultRelationDesc,
 										   resultRelInfo->ri_RangeTableIndex,
 										   LockTupleExclusive,
 										   &hufd.ctid,
 										   hufd.xmax);
-					if (!TupIsNull(epqslot))
+					if (!TupIsNull(my_epqslot))
 					{
 						*tupleid = hufd.ctid;
-						goto ldelete;
+
+						/*
+						 * If this is part of an UPDATE of partition-key, the
+						 * epq tuple will contain the changes from this
+						 * transaction over and above the updates done by the
+						 * other transaction. The caller should now use this
+						 * tuple as its NEW tuple, rather than the earlier NEW
+						 * tuple.
+						 */
+						if (epqslot)
+						{
+							*epqslot = my_epqslot;
+							return NULL;
+						}
+						else
+						{
+							/* Normal DELETE: So delete the re-fetched row. */
+							goto ldelete;
+						}
 					}
 				}
 				/* tuple already deleted; nothing to do */
@@ -1058,6 +1079,7 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleTableSlot *epqslot = NULL;
 			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
@@ -1087,7 +1109,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
-					   estate, &tuple_deleted, false,
+					   estate, &tuple_deleted, &epqslot, false,
 					   false /* canSetTag */ , true /* changingPart */ );
 
 			/*
@@ -1111,7 +1133,23 @@ lreplace:;
 			 * resurrect it.
 			 */
 			if (!tuple_deleted)
-				return NULL;
+			{
+				/*
+				 * epqslot will be typically NULL. But when ExecDelete() finds
+				 * that another transaction has concurrently updated the same
+				 * row, it re-fetches the row, skips the delete, and epqslot is
+				 * set to the re-fetched tuple slot. In that case, we need to
+				 * do all the checks again.
+				 */
+				if (TupIsNull(epqslot))
+					return NULL;
+				else
+				{
+					slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+					tuple = ExecMaterializeSlot(slot);
+					goto lreplace;
+				}
+			}
 
 			/*
 			 * Updates set the transition capture map only when a new subplan
@@ -2145,7 +2183,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag,
+								  NULL, NULL, true, node->canSetTag,
 								  false /* changingPart */ );
 				break;
 			default:
diff --git a/src/test/isolation/expected/partition-key-update-1.out b/src/test/isolation/expected/partition-key-update-1.out
index 37fe6a7..3c23c69 100644
--- a/src/test/isolation/expected/partition-key-update-1.out
+++ b/src/test/isolation/expected/partition-key-update-1.out
@@ -55,6 +55,19 @@ step s1u2: <... completed>
 error in steps s2c s1u2: ERROR:  tuple to be locked was already moved to another partition due to concurrent update
 step s1c: COMMIT;
 
+starting permutation: s1b s2b s2u3 s1u3 s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u3: UPDATE foo SET b = b || ' update2' WHERE a = 1;
+step s1u3: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE a=1; <waiting ...>
+step s2c: COMMIT;
+step s1u3: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT * FROM foo ORDER BY 1;
+a              b              
+
+2              ABC update2 update1
+
 starting permutation: s1b s2b s1u3pc s2i s1c s2c
 step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
 step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
diff --git a/src/test/isolation/specs/partition-key-update-1.spec b/src/test/isolation/specs/partition-key-update-1.spec
index 8393f47..eee0c42 100644
--- a/src/test/isolation/specs/partition-key-update-1.spec
+++ b/src/test/isolation/specs/partition-key-update-1.spec
@@ -1,5 +1,9 @@
-# Test that an error if thrown if the target row has been moved to a
-# different partition by a concurrent session.
+# Test concurrency scenarios when a row is being moved to a different partition:
+#
+# 1. An error is thrown if the target row has been moved to a different
+#    partition by a concurrent session.
+# 2. Row that ends up in a new partition should contain changes made by
+#    a concurrent transaction.
 
 setup
 {
@@ -50,8 +54,10 @@ session "s1"
 step "s1b"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
 step "s1u"	{ UPDATE foo SET a=2 WHERE a=1; }
 step "s1u2"	{ UPDATE footrg SET b='EFG' WHERE a=1; }
+step "s1u3" { UPDATE foo SET a = a + 1, b = b || ' update1' WHERE a=1; }
 step "s1u3pc"	{ UPDATE foo_range_parted SET a=11 WHERE a=7; }
 step "s1u3npc"	{ UPDATE foo_range_parted SET b='XYZ' WHERE a=7; }
+step "s1s" { SELECT * FROM foo ORDER BY 1; }
 step "s1c"	{ COMMIT; }
 step "s1r"	{ ROLLBACK; }
 
@@ -59,6 +65,7 @@ session "s2"
 step "s2b"	{ BEGIN ISOLATION LEVEL READ COMMITTED; }
 step "s2u"	{ UPDATE foo SET b='EFG' WHERE a=1; }
 step "s2u2"	{ UPDATE footrg SET b='XYZ' WHERE a=1; }
+step "s2u3" { UPDATE foo SET b = b || ' update2' WHERE a = 1; }
 step "s2i"	{ INSERT INTO bar VALUES(7); }
 step "s2d"	{ DELETE FROM foo WHERE a=1; }
 step "s2c"	{ COMMIT; }
@@ -73,6 +80,11 @@ permutation "s1b" "s2b" "s1u2" "s1c" "s2u2" "s2c"
 permutation "s1b" "s2b" "s1u2" "s2u2" "s1c" "s2c"
 permutation "s1b" "s2b" "s2u2" "s1u2" "s2c" "s1c"
 
+# Transaction A is moving a row into another partition, but is waiting for
+# another transaction B that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by transaction B.
+permutation "s1b" "s2b" "s2u3" "s1u3" "s2c" "s1c" "s1s"
+
 # Concurrency error from ExecLockRows
 # test waiting for moved row itself
 permutation "s1b" "s2b" "s1u3pc" "s2i" "s1c" "s2c"

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#4)

Re: Concurrency bug in UPDATE of partition-key

On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Attached is a rebased patch version. Also included it in the upcoming
commitfest :
https://commitfest.postgresql.org/18/1660/

Doesn't this belong to PostgreSQL 11 Open Items [1]https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items or are you proposing it
as a feature enhancement for next version?

In the rebased version, the new test cases are added in the existing
isolation/specs/partition-key-update-1.spec test.

  if (!tuple_deleted)
- return NULL;
+ {
+ /*
+ * epqslot will be typically NULL. But when ExecDelete() finds
+ * that another transaction has concurrently updated the same
+ * row, it re-fetches the row, skips the delete, and epqslot is
+ * set to the re-fetched tuple slot. In that case, we need to
+ * do all the checks again.
+ */
+ if (TupIsNull(epqslot))
+ return NULL;
+ else
+ {
+ slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+ tuple = ExecMaterializeSlot(slot);
+ goto lreplace;
+ }
+ }

I think this will allow before row delete triggers to be fired more than
once. Normally, if the EvalPlanQual testing generates a new tuple, we
don't fire the triggers again. Now, one possibility could be that we don't
skip the delete after a concurrent update and still allow inserts to use a
new tuple generated by EvalPlanQual testing of delete. However, I think we
need to perform rechecks for update to check if the modified tuple still
requires to be moved to new partition, right or do you have some other
reason in mind?

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#5)

Re: Concurrency bug in UPDATE of partition-key

On 7 June 2018 at 11:44, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Attached is a rebased patch version. Also included it in the upcoming
commitfest :
https://commitfest.postgresql.org/18/1660/

Doesn't this belong to PostgreSQL 11 Open Items [1] or are you proposing it
as a feature enhancement for next version?

Yes, it needs to be in the Open items. I will have it added in that list.

In the rebased version, the new test cases are added in the existing
isolation/specs/partition-key-update-1.spec test.
if (!tuple_deleted)
- return NULL;
+ {
+ /*
+ * epqslot will be typically NULL. But when ExecDelete() finds
+ * that another transaction has concurrently updated the same
+ * row, it re-fetches the row, skips the delete, and epqslot is
+ * set to the re-fetched tuple slot. In that case, we need to
+ * do all the checks again.
+ */
+ if (TupIsNull(epqslot))
+ return NULL;
+ else
+ {
+ slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+ tuple = ExecMaterializeSlot(slot);
+ goto lreplace;
+ }
+ }
I think this will allow before row delete triggers to be fired more than
once. Normally, if the EvalPlanQual testing generates a new tuple, we don't
fire the triggers again.

If there are BR delete triggers, the tuple will be locked using
GetTupleForTrigger(). So the subsequent EvalPlanQual testing won't be
run, since the tuple is already locked due to triggers having run.

But that leads me to think : The same concurrency issue can occur in
GetTupleForTrigger() also. Say, a concurrent session has already
locked the tuple, and GetTupleForTrigger() would wait and then return
the updated tuple in its last parameter newSlot. In that case, we need
to pass this slot back through ExecBRDeleteTriggers(), and further
through epqslot parameter of ExecDelete(). But yes, in this case, we
should avoid calling this trigger function the second time.

If you agree on the above, I will send an updated patch.

Now, one possibility could be that we don't skip
the delete after a concurrent update and still allow inserts to use a new
tuple generated by EvalPlanQual testing of delete. However, I think we need
to perform rechecks for update to check if the modified tuple still requires
to be moved to new partition, right or do you have some other reason in
mind?

Yes, that's the reason we need to start again from lreplace; i.e., for
checking not just the partition constraints but all the other checks
that were required earlier, because the tuple has been modified. It
may even end up moving to a different partition than the one chosen
earlier.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Dilip Kumar

dilipbalaut@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#4)

Re: Concurrency bug in UPDATE of partition-key

On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Attached is a rebased patch version. Also included it in the upcoming
commitfest :
https://commitfest.postgresql.org/18/1660/

In the rebased version, the new test cases are added in the existing
isolation/specs/partition-key-update-1.spec test.

/*
+  * If this is part of an UPDATE of partition-key, the
+  * epq tuple will contain the changes from this
+  * transaction over and above the updates done by the
+  * other transaction. The caller should now use this
+  * tuple as its NEW tuple, rather than the earlier NEW
+  * tuple.
+  */
+ if (epqslot)
+ {
+ *epqslot = my_epqslot;
+ return NULL;
+ }

I think we need simmilar fix if there are BR Delete trigger and the
ExecDelete is blocked on heap_lock_tuple because the concurrent transaction
is updating the same row. Because in such case it would have already got
the final tuple so the hep_delete will return MayBeUpdated.

Below test can reproduce the issue.

CREATE TABLE pa_target (key integer, val text) PARTITION BY LIST (key);
CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);

CREATE TABLE deleted_row (count int);
CREATE OR REPLACE FUNCTION br_delete() RETURNS trigger AS
$$BEGIN
insert into deleted_row values(OLD.key);
RETURN OLD;
END;$$ LANGUAGE plpgsql;

CREATE TRIGGER test_br_trig BEFORE DELETE ON part1 FOR EACH ROW EXECUTE
PROCEDURE br_delete();

INSERT INTO pa_target VALUES (1, 'initial1');

session1:
postgres=# BEGIN;
BEGIN
postgres=# UPDATE pa_target SET val = val || ' updated by update1' WHERE
key = 1;
UPDATE 1

session2:
postgres=# UPDATE pa_target SET val = val || ' updated by update2', key =
key + 1 WHERE key =1;
<block>

session1:
postgres=# commit;
COMMIT

UPDATE 1
postgres=# select * from pa_target ;
key | val
-----+-----------------------------
2 | initial1 updated by update2 --> session1's update is overwritten.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

over 7 years ago

In reply to: Dilip Kumar (#7)

Re: Concurrency bug in UPDATE of partition-key

On Fri, Jun 8, 2018 at 2:23 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Attached is a rebased patch version. Also included it in the upcoming
commitfest :
https://commitfest.postgresql.org/18/1660/

In the rebased version, the new test cases are added in the existing
isolation/specs/partition-key-update-1.spec test.
/*
+  * If this is part of an UPDATE of partition-key, the
+  * epq tuple will contain the changes from this
+  * transaction over and above the updates done by the
+  * other transaction. The caller should now use this
+  * tuple as its NEW tuple, rather than the earlier NEW
+  * tuple.
+  */
+ if (epqslot)
+ {
+ *epqslot = my_epqslot;
+ return NULL;
+ }
I think we need simmilar fix if there are BR Delete trigger and the
ExecDelete is blocked on heap_lock_tuple because the concurrent transaction
is updating the same row. Because in such case it would have already got
the final tuple so the hep_delete will return MayBeUpdated.

Amit Kapila had pointed (offlist) that you are already trying to fix this
issue. I have one more question, Suppose the first time we come for
ExecDelete and fire the BR delete trigger and then realize that we need to
retry because of the concurrent update. Now, during retry, we realize that
because of the latest value we don't need to route the tuple to the
different partition so now we will call ExecUpdate and may fire the
BRUpdate triggers as well?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Dilip Kumar (#8)

Re: Concurrency bug in UPDATE of partition-key

On 8 June 2018 at 16:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 8, 2018 at 2:23 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Attached is a rebased patch version. Also included it in the upcoming
commitfest :
https://commitfest.postgresql.org/18/1660/

In the rebased version, the new test cases are added in the existing
isolation/specs/partition-key-update-1.spec test.
/*
+  * If this is part of an UPDATE of partition-key, the
+  * epq tuple will contain the changes from this
+  * transaction over and above the updates done by the
+  * other transaction. The caller should now use this
+  * tuple as its NEW tuple, rather than the earlier NEW
+  * tuple.
+  */
+ if (epqslot)
+ {
+ *epqslot = my_epqslot;
+ return NULL;
+ }
I think we need simmilar fix if there are BR Delete trigger and the
ExecDelete is blocked on heap_lock_tuple because the concurrent transaction
is updating the same row. Because in such case it would have already got
the final tuple so the hep_delete will return MayBeUpdated.
Amit Kapila had pointed (offlist) that you are already trying to fix this
issue.

Yes, his comment about triggers made me realize the trigger issue
which you mentioned, about which I also explained in earlier reply.

I have one more question, Suppose the first time we come for
ExecDelete and fire the BR delete trigger and then realize that we need to
retry because of the concurrent update. Now, during retry, we realize that
because of the latest value we don't need to route the tuple to the
different partition so now we will call ExecUpdate and may fire the BRUpdate
triggers as well?

No, we don't fire the BR update trigger again. We go to lreplace
label, and BR trigger code is above that label. The reason we don't
need to run the trigger check again is because if we had already fired
it before, the row must have been locked, so concurrency scenario
won't arise in the first place.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#10

Dilip Kumar

dilipbalaut@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#9)

Re: Concurrency bug in UPDATE of partition-key

On Mon, Jun 11, 2018 at 10:19 AM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

On 8 June 2018 at 16:44, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Jun 8, 2018 at 2:23 PM, Dilip Kumar <dilipbalaut@gmail.com>

wrote:
On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

Attached is a rebased patch version. Also included it in the upcoming
commitfest :
https://commitfest.postgresql.org/18/1660/

In the rebased version, the new test cases are added in the existing
isolation/specs/partition-key-update-1.spec test.
/*
+  * If this is part of an UPDATE of partition-key, the
+  * epq tuple will contain the changes from this
+  * transaction over and above the updates done by the
+  * other transaction. The caller should now use this
+  * tuple as its NEW tuple, rather than the earlier NEW
+  * tuple.
+  */
+ if (epqslot)
+ {
+ *epqslot = my_epqslot;
+ return NULL;
+ }
I think we need simmilar fix if there are BR Delete trigger and the
ExecDelete is blocked on heap_lock_tuple because the concurrent
transaction

is updating the same row. Because in such case it would have already

got

the final tuple so the hep_delete will return MayBeUpdated.

Amit Kapila had pointed (offlist) that you are already trying to fix this
issue.

Yes, his comment about triggers made me realize the trigger issue
which you mentioned, about which I also explained in earlier reply.

I have one more question, Suppose the first time we come for
ExecDelete and fire the BR delete trigger and then realize that we need

to

retry because of the concurrent update. Now, during retry, we realize

that

because of the latest value we don't need to route the tuple to the
different partition so now we will call ExecUpdate and may fire the

BRUpdate

triggers as well?

No, we don't fire the BR update trigger again. We go to lreplace
label, and BR trigger code is above that label. The reason we don't
need to run the trigger check again is because if we had already fired
it before, the row must have been locked, so concurrency scenario
won't arise in the first place.

Ok, that make sense.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#11

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#6)

Re: Concurrency bug in UPDATE of partition-key

On Thu, Jun 7, 2018 at 1:53 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 7 June 2018 at 11:44, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

I think this will allow before row delete triggers to be fired more than
once. Normally, if the EvalPlanQual testing generates a new tuple, we don't
fire the triggers again.

If there are BR delete triggers, the tuple will be locked using
GetTupleForTrigger(). So the subsequent EvalPlanQual testing won't be
run, since the tuple is already locked due to triggers having run.

But that leads me to think : The same concurrency issue can occur in
GetTupleForTrigger() also. Say, a concurrent session has already
locked the tuple, and GetTupleForTrigger() would wait and then return
the updated tuple in its last parameter newSlot. In that case, we need
to pass this slot back through ExecBRDeleteTriggers(), and further
through epqslot parameter of ExecDelete(). But yes, in this case, we
should avoid calling this trigger function the second time.

If you agree on the above, I will send an updated patch.

Sounds reasonable to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Kapila (#11)

Re: Concurrency bug in UPDATE of partition-key

On Mon, Jun 11, 2018 at 3:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 7, 2018 at 1:53 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 7 June 2018 at 11:44, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

I think this will allow before row delete triggers to be fired more than
once. Normally, if the EvalPlanQual testing generates a new tuple, we don't
fire the triggers again.

If there are BR delete triggers, the tuple will be locked using
GetTupleForTrigger(). So the subsequent EvalPlanQual testing won't be
run, since the tuple is already locked due to triggers having run.

But that leads me to think : The same concurrency issue can occur in
GetTupleForTrigger() also. Say, a concurrent session has already
locked the tuple, and GetTupleForTrigger() would wait and then return
the updated tuple in its last parameter newSlot. In that case, we need
to pass this slot back through ExecBRDeleteTriggers(), and further
through epqslot parameter of ExecDelete(). But yes, in this case, we
should avoid calling this trigger function the second time.

If you agree on the above, I will send an updated patch.

Sounds reasonable to me.

Try to add a test case which covers BR trigger code path where you are
planning to update.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#13

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#12)

1 attachment(s)

Re: Concurrency bug in UPDATE of partition-key

On 11 June 2018 at 15:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 11, 2018 at 3:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 7, 2018 at 1:53 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 7 June 2018 at 11:44, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 5, 2018 at 8:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com>
wrote:

I think this will allow before row delete triggers to be fired more than
once. Normally, if the EvalPlanQual testing generates a new tuple, we don't
fire the triggers again.

If there are BR delete triggers, the tuple will be locked using
GetTupleForTrigger(). So the subsequent EvalPlanQual testing won't be
run, since the tuple is already locked due to triggers having run.

But that leads me to think : The same concurrency issue can occur in
GetTupleForTrigger() also. Say, a concurrent session has already
locked the tuple, and GetTupleForTrigger() would wait and then return
the updated tuple in its last parameter newSlot. In that case, we need
to pass this slot back through ExecBRDeleteTriggers(), and further
through epqslot parameter of ExecDelete(). But yes, in this case, we
should avoid calling this trigger function the second time.

If you agree on the above, I will send an updated patch.

Sounds reasonable to me.

Attached is v2 version of the patch. It contains the above
trigger-related issue fixed.

The updated tuple is passed back using the existing newslot parameter
of GetTupleForTrigger(). When ExecBRDeleteTriggers() is called using a
new epqslot parameter, it means caller wants to skip the trigger
execution, because the updated tuple needs to be again checked for
constraints. I have added comments of this behaviour in the
ExecBRDeleteTriggers() function header.

Because the trigger execution is skipped the first time ExecDelete()
is called, I didn't have to explicitly add any changes for preventing
trigger execution the second time.

Try to add a test case which covers BR trigger code path where you are
planning to update.

Added the test cases. Also added cases where the concurrent session
causes the EvalPlanQual() test to fail, so the partition key update
does not happen.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

fix_concurrency_bug_v2.patchapplication/octet-stream; name=fix_concurrency_bug_v2.patchDownload

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 57519fe..bd61c9a 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2726,27 +2726,60 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 							  false, NULL, NULL, NIL, NULL, transition_capture);
 }
 
+/* ----------
+ * ExecBRDeleteTriggers()
+ *
+ * Called to execute BEFORE ROW DELETE triggers.
+ *
+ * Returns false in following scenarios :
+ * 1. Trigger function returned NULL.
+ * 2. The tuple was concurrently deleted OR it was concurrently updated and the
+ * new tuple didn't pass EvalPlanQual() test.
+ * 3. The tuple was concurrently updated and the new tuple passed the
+ * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
+ * function skips the trigger function execution, because the caller has
+ * indicated that it wants to further process the updated tuple. The updated
+ * tuple slot is passed back through epqslot.
+ *
+ * In all other cases, returns true.
+ * ----------
+ */
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple)
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot)
 {
 	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
 	bool		result = true;
 	TriggerData LocTriggerData;
 	HeapTuple	trigtuple;
 	HeapTuple	newtuple;
-	TupleTableSlot *newSlot;
 	int			i;
 
 	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
 	if (fdw_trigtuple == NULL)
 	{
+		TupleTableSlot *newSlot;
+
 		trigtuple = GetTupleForTrigger(estate, epqstate, relinfo, tupleid,
 									   LockTupleExclusive, &newSlot);
+
+		/* If requested, pass back the concurrently updated tuple, if any. */
+		if (epqslot != NULL)
+			*epqslot = newSlot;
+
 		if (trigtuple == NULL)
 			return false;
+
+		/*
+		 * If the tuple was concurrently updated and the caller of this
+		 * function requested for the updated tuple, skip the trigger
+		 * execution.
+		 */
+		if (newSlot != NULL && epqslot != NULL)
+			return false;
 	}
 	else
 		trigtuple = fdw_trigtuple;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 41e857e..5006baa 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -531,7 +531,7 @@ ExecSimpleRelationDelete(EState *estate, EPQState *epqstate,
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
 										   &searchslot->tts_tuple->t_self,
-										   NULL);
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7e0b867..b12a41f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -609,7 +609,9 @@ ExecInsert(ModifyTableState *mtstate,
  *		foreign table, tupleid is invalid; the FDW has to figure out
  *		which row to delete using data from the planSlot.  oldtuple is
  *		passed to foreign table triggers; it is NULL when the foreign
- *		table has no relevant triggers.
+ *		table has no relevant triggers. When this DELETE is a part of
+ *		an UPDATE of partition-key, then the slot returned by
+ *		EvalPlanQual() is passed back using output parameter epqslot.
  *
  *		Returns RETURNING result if any, otherwise NULL.
  * ----------------------------------------------------------------
@@ -622,6 +624,7 @@ ExecDelete(ModifyTableState *mtstate,
 		   EPQState *epqstate,
 		   EState *estate,
 		   bool *tupleDeleted,
+		   TupleTableSlot **epqslot,
 		   bool processReturning,
 		   bool canSetTag,
 		   bool changingPart)
@@ -649,7 +652,7 @@ ExecDelete(ModifyTableState *mtstate,
 		bool		dodelete;
 
 		dodelete = ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										tupleid, oldtuple);
+										tupleid, oldtuple, epqslot);
 
 		if (!dodelete)			/* "do nothing" */
 			return NULL;
@@ -769,19 +772,37 @@ ldelete:;
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
-					TupleTableSlot *epqslot;
+					TupleTableSlot *my_epqslot;
 
-					epqslot = EvalPlanQual(estate,
+					my_epqslot = EvalPlanQual(estate,
 										   epqstate,
 										   resultRelationDesc,
 										   resultRelInfo->ri_RangeTableIndex,
 										   LockTupleExclusive,
 										   &hufd.ctid,
 										   hufd.xmax);
-					if (!TupIsNull(epqslot))
+					if (!TupIsNull(my_epqslot))
 					{
 						*tupleid = hufd.ctid;
-						goto ldelete;
+
+						/*
+						 * If this is part of an UPDATE of partition-key, the
+						 * epq tuple will contain the changes from this
+						 * transaction over and above the updates done by the
+						 * other transaction. The caller should now use this
+						 * tuple as its NEW tuple, rather than the earlier NEW
+						 * tuple.
+						 */
+						if (epqslot)
+						{
+							*epqslot = my_epqslot;
+							return NULL;
+						}
+						else
+						{
+							/* Normal DELETE: So delete the re-fetched row. */
+							goto ldelete;
+						}
 					}
 				}
 				/* tuple already deleted; nothing to do */
@@ -1052,6 +1073,7 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleTableSlot *epqslot = NULL;
 			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
@@ -1081,7 +1103,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
-					   estate, &tuple_deleted, false,
+					   estate, &tuple_deleted, &epqslot, false,
 					   false /* canSetTag */ , true /* changingPart */ );
 
 			/*
@@ -1105,7 +1127,23 @@ lreplace:;
 			 * resurrect it.
 			 */
 			if (!tuple_deleted)
-				return NULL;
+			{
+				/*
+				 * epqslot will be typically NULL. But when ExecDelete() finds
+				 * that another transaction has concurrently updated the same
+				 * row, it re-fetches the row, skips the delete, and epqslot is
+				 * set to the re-fetched tuple slot. In that case, we need to
+				 * do all the checks again.
+				 */
+				if (TupIsNull(epqslot))
+					return NULL;
+				else
+				{
+					slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+					tuple = ExecMaterializeSlot(slot);
+					goto lreplace;
+				}
+			}
 
 			/*
 			 * Updates set the transition capture map only when a new subplan
@@ -2136,7 +2174,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag,
+								  NULL, NULL, true, node->canSetTag,
 								  false /* changingPart */ );
 				break;
 			default:
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index a5b8610..1031448 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -206,7 +206,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 					 EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple);
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot);
 extern void ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 0000000..ab05a9d
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,53 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s2b s2u1 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u1: UPDATE foo SET b = b || ' update2' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo2           2              ABC update2 update1
+
+starting permutation: s1b s2b s2ut1 s1ut s2c s1c s1st
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut1: UPDATE footrg SET b = b || ' update2' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg2        2              ABC update2 update1
+
+starting permutation: s1b s2b s2u2 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u2: UPDATE foo SET b = 'EFG' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo1           1              EFG            
+
+starting permutation: s1b s2b s2ut2 s1ut s2c s1c s1st
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut2: UPDATE footrg SET b = 'EFG' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg1        1              EFG            
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 0e99721..d5594e8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -74,4 +74,5 @@ test: predicate-gin-nomatch
 test: partition-key-update-1
 test: partition-key-update-2
 test: partition-key-update-3
+test: partition-key-update-4
 test: plpgsql-toast
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 0000000..98b4aba
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,66 @@
+# Test that a row that ends up in a new partition contains changes made by
+# a concurrent transaction.
+
+setup
+{
+  --
+  -- Setup to test concurrent handling of ExecDelete().
+  --
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+
+  --
+  -- Setup to test concurrent handling of GetTupleForTrigger().
+  --
+  CREATE TABLE footrg (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE footrg1 PARTITION OF footrg FOR VALUES IN (1);
+  CREATE TABLE footrg2 PARTITION OF footrg FOR VALUES IN (2);
+  INSERT INTO footrg VALUES (1, 'ABC');
+  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
+  BEGIN
+     RETURN OLD;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg();
+
+}
+
+teardown
+{
+  DROP TABLE foo;
+  DROP TRIGGER footrg_ondel ON footrg1;
+  DROP FUNCTION func_footrg();
+  DROP TABLE footrg;
+}
+
+session "s1"
+step "s1b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"   { UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1ut"  { UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1s"   { SELECT tableoid::regclass, * FROM foo ORDER BY a; }
+step "s1st"  { SELECT tableoid::regclass, * FROM footrg ORDER BY a; }
+step "s1c"	 { COMMIT; }
+
+session "s2"
+step "s2b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u1"  { UPDATE foo SET b = b || ' update2' WHERE a = 1; }
+step "s2u2"  { UPDATE foo SET b = 'EFG' WHERE a = 1; }
+step "s2ut1" { UPDATE footrg SET b = b || ' update2' WHERE a = 1; }
+step "s2ut2" { UPDATE footrg SET b = 'EFG' WHERE a = 1; }
+step "s2c"	 { COMMIT; }
+
+
+# Session A is moving a row into another partition, but is waiting for
+# another session B that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by session B.
+permutation "s1b" "s2b" "s2u1" "s1u" "s2c" "s1c" "s1s"
+
+# Same as above, except, session A is waiting in GetTupleTrigger().
+permutation "s1b" "s2b" "s2ut1" "s1ut" "s2c" "s1c" "s1st"
+
+# Below two cases are similar to the above two; except that the session A
+# fails EvalPlanQual() test, so partition key update does not happen.
+permutation "s1b" "s2b" "s2u2" "s1u" "s2c" "s1c" "s1s"
+permutation "s1b" "s2b" "s2ut2" "s1ut" "s2c" "s1c" "s1st"

#14

Dilip Kumar

dilipbalaut@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#13)

Re: Concurrency bug in UPDATE of partition-key

On Mon, Jun 18, 2018 at 10:21 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is v2 version of the patch. It contains the above
trigger-related issue fixed.

The updated tuple is passed back using the existing newslot parameter
of GetTupleForTrigger(). When ExecBRDeleteTriggers() is called using a
new epqslot parameter, it means caller wants to skip the trigger
execution, because the updated tuple needs to be again checked for
constraints. I have added comments of this behaviour in the
ExecBRDeleteTriggers() function header.

Thanks for the updated patch. I have verified the BR trigger
behaviour, its working fine with the patch.

+  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
+  BEGIN
+     RETURN OLD;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg();

Should we also create a test case where we can verify that some
unnecessary or duplicate triggers are not executed? For example, in
the above trigger function, we can maintain a count in some table (e.g
how many time delete trigger got executed) and after test over we can
verify the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Dilip Kumar (#14)

Re: Concurrency bug in UPDATE of partition-key

On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 18, 2018 at 10:21 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Attached is v2 version of the patch. It contains the above
trigger-related issue fixed.

The updated tuple is passed back using the existing newslot parameter
of GetTupleForTrigger(). When ExecBRDeleteTriggers() is called using a
new epqslot parameter, it means caller wants to skip the trigger
execution, because the updated tuple needs to be again checked for
constraints. I have added comments of this behaviour in the
ExecBRDeleteTriggers() function header.

Thanks for the updated patch. I have verified the BR trigger
behaviour, its working fine with the patch.
+  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
+  BEGIN
+     RETURN OLD;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg();
Should we also create a test case where we can verify that some
unnecessary or duplicate triggers are not executed?

I am not sure how much value we will add by having such a test. In
general, it is good to have tests that cover various aspects of
functionality, but OTOH, we have to be careful to not overdo it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#16

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#15)

Re: Concurrency bug in UPDATE of partition-key

On 18 June 2018 at 17:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Should we also create a test case where we can verify that some
unnecessary or duplicate triggers are not executed?

I am not sure how much value we will add by having such a test. In
general, it is good to have tests that cover various aspects of
functionality, but OTOH, we have to be careful to not overdo it.

Actually I am thinking, it's not a big deal adding a RAISE statement
in trigger function in the existing testcases. It will clearly show how
many times the trigger has executed. So I will go ahead and do that.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#17

Dilip Kumar

dilipbalaut@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#16)

Re: Concurrency bug in UPDATE of partition-key

On Mon, Jun 18, 2018 at 6:19 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 18 June 2018 at 17:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Should we also create a test case where we can verify that some
unnecessary or duplicate triggers are not executed?

I am not sure how much value we will add by having such a test. In
general, it is good to have tests that cover various aspects of
functionality, but OTOH, we have to be careful to not overdo it.

Actually I am thinking, it's not a big deal adding a RAISE statement
in trigger function in the existing testcases. It will clearly show how
many times the trigger has executed. So I will go ahead and do that.

Ok, That makes sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#18

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Dilip Kumar (#17)

1 attachment(s)

Re: Concurrency bug in UPDATE of partition-key

On 19 June 2018 at 13:06, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 18, 2018 at 6:19 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 18 June 2018 at 17:56, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Should we also create a test case where we can verify that some
unnecessary or duplicate triggers are not executed?

I am not sure how much value we will add by having such a test. In
general, it is good to have tests that cover various aspects of
functionality, but OTOH, we have to be careful to not overdo it.

Actually I am thinking, it's not a big deal adding a RAISE statement
in trigger function in the existing testcases. It will clearly show how
many times the trigger has executed. So I will go ahead and do that.

Ok, That makes sense to me.

Could not add RAISE statement, because the isolation test does not
seem to print those statements in the output. Instead, I have dumped
some rows in a new table through the trigger function, and did a
select from that table. Attached is v3 patch.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

fix_concurrency_bug_v3.patchapplication/octet-stream; name=fix_concurrency_bug_v3.patchDownload

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 57519fe..bd61c9a 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2726,27 +2726,60 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 							  false, NULL, NULL, NIL, NULL, transition_capture);
 }
 
+/* ----------
+ * ExecBRDeleteTriggers()
+ *
+ * Called to execute BEFORE ROW DELETE triggers.
+ *
+ * Returns false in following scenarios :
+ * 1. Trigger function returned NULL.
+ * 2. The tuple was concurrently deleted OR it was concurrently updated and the
+ * new tuple didn't pass EvalPlanQual() test.
+ * 3. The tuple was concurrently updated and the new tuple passed the
+ * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
+ * function skips the trigger function execution, because the caller has
+ * indicated that it wants to further process the updated tuple. The updated
+ * tuple slot is passed back through epqslot.
+ *
+ * In all other cases, returns true.
+ * ----------
+ */
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple)
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot)
 {
 	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
 	bool		result = true;
 	TriggerData LocTriggerData;
 	HeapTuple	trigtuple;
 	HeapTuple	newtuple;
-	TupleTableSlot *newSlot;
 	int			i;
 
 	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
 	if (fdw_trigtuple == NULL)
 	{
+		TupleTableSlot *newSlot;
+
 		trigtuple = GetTupleForTrigger(estate, epqstate, relinfo, tupleid,
 									   LockTupleExclusive, &newSlot);
+
+		/* If requested, pass back the concurrently updated tuple, if any. */
+		if (epqslot != NULL)
+			*epqslot = newSlot;
+
 		if (trigtuple == NULL)
 			return false;
+
+		/*
+		 * If the tuple was concurrently updated and the caller of this
+		 * function requested for the updated tuple, skip the trigger
+		 * execution.
+		 */
+		if (newSlot != NULL && epqslot != NULL)
+			return false;
 	}
 	else
 		trigtuple = fdw_trigtuple;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 41e857e..5006baa 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -531,7 +531,7 @@ ExecSimpleRelationDelete(EState *estate, EPQState *epqstate,
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
 										   &searchslot->tts_tuple->t_self,
-										   NULL);
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7e0b867..b12a41f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -609,7 +609,9 @@ ExecInsert(ModifyTableState *mtstate,
  *		foreign table, tupleid is invalid; the FDW has to figure out
  *		which row to delete using data from the planSlot.  oldtuple is
  *		passed to foreign table triggers; it is NULL when the foreign
- *		table has no relevant triggers.
+ *		table has no relevant triggers. When this DELETE is a part of
+ *		an UPDATE of partition-key, then the slot returned by
+ *		EvalPlanQual() is passed back using output parameter epqslot.
  *
  *		Returns RETURNING result if any, otherwise NULL.
  * ----------------------------------------------------------------
@@ -622,6 +624,7 @@ ExecDelete(ModifyTableState *mtstate,
 		   EPQState *epqstate,
 		   EState *estate,
 		   bool *tupleDeleted,
+		   TupleTableSlot **epqslot,
 		   bool processReturning,
 		   bool canSetTag,
 		   bool changingPart)
@@ -649,7 +652,7 @@ ExecDelete(ModifyTableState *mtstate,
 		bool		dodelete;
 
 		dodelete = ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										tupleid, oldtuple);
+										tupleid, oldtuple, epqslot);
 
 		if (!dodelete)			/* "do nothing" */
 			return NULL;
@@ -769,19 +772,37 @@ ldelete:;
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
-					TupleTableSlot *epqslot;
+					TupleTableSlot *my_epqslot;
 
-					epqslot = EvalPlanQual(estate,
+					my_epqslot = EvalPlanQual(estate,
 										   epqstate,
 										   resultRelationDesc,
 										   resultRelInfo->ri_RangeTableIndex,
 										   LockTupleExclusive,
 										   &hufd.ctid,
 										   hufd.xmax);
-					if (!TupIsNull(epqslot))
+					if (!TupIsNull(my_epqslot))
 					{
 						*tupleid = hufd.ctid;
-						goto ldelete;
+
+						/*
+						 * If this is part of an UPDATE of partition-key, the
+						 * epq tuple will contain the changes from this
+						 * transaction over and above the updates done by the
+						 * other transaction. The caller should now use this
+						 * tuple as its NEW tuple, rather than the earlier NEW
+						 * tuple.
+						 */
+						if (epqslot)
+						{
+							*epqslot = my_epqslot;
+							return NULL;
+						}
+						else
+						{
+							/* Normal DELETE: So delete the re-fetched row. */
+							goto ldelete;
+						}
 					}
 				}
 				/* tuple already deleted; nothing to do */
@@ -1052,6 +1073,7 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleTableSlot *epqslot = NULL;
 			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
@@ -1081,7 +1103,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
-					   estate, &tuple_deleted, false,
+					   estate, &tuple_deleted, &epqslot, false,
 					   false /* canSetTag */ , true /* changingPart */ );
 
 			/*
@@ -1105,7 +1127,23 @@ lreplace:;
 			 * resurrect it.
 			 */
 			if (!tuple_deleted)
-				return NULL;
+			{
+				/*
+				 * epqslot will be typically NULL. But when ExecDelete() finds
+				 * that another transaction has concurrently updated the same
+				 * row, it re-fetches the row, skips the delete, and epqslot is
+				 * set to the re-fetched tuple slot. In that case, we need to
+				 * do all the checks again.
+				 */
+				if (TupIsNull(epqslot))
+					return NULL;
+				else
+				{
+					slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+					tuple = ExecMaterializeSlot(slot);
+					goto lreplace;
+				}
+			}
 
 			/*
 			 * Updates set the transition capture map only when a new subplan
@@ -2136,7 +2174,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag,
+								  NULL, NULL, true, node->canSetTag,
 								  false /* changingPart */ );
 				break;
 			default:
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index a5b8610..1031448 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -206,7 +206,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 					 EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple);
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot);
 extern void ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 0000000..774a7fa
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,60 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s2b s2u1 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u1: UPDATE foo SET b = b || ' update2' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo2           2              ABC update2 update1
+
+starting permutation: s1b s2b s2ut1 s1ut s2c s1c s1st s1stl
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut1: UPDATE footrg SET b = b || ' update2' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg2        2              ABC update2 update1
+step s1stl: SELECT * FROM triglog ORDER BY a;
+a              b              
+
+1              ABC update2 trigger
+
+starting permutation: s1b s2b s2u2 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u2: UPDATE foo SET b = 'EFG' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo1           1              EFG            
+
+starting permutation: s1b s2b s2ut2 s1ut s2c s1c s1st s1stl
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut2: UPDATE footrg SET b = 'EFG' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg1        1              EFG            
+step s1stl: SELECT * FROM triglog ORDER BY a;
+a              b              
+
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 0e99721..d5594e8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -74,4 +74,5 @@ test: predicate-gin-nomatch
 test: partition-key-update-1
 test: partition-key-update-2
 test: partition-key-update-3
+test: partition-key-update-4
 test: plpgsql-toast
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 0000000..9336645
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,76 @@
+# Test that a row that ends up in a new partition contains changes made by
+# a concurrent transaction.
+
+setup
+{
+  --
+  -- Setup to test concurrent handling of ExecDelete().
+  --
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+
+  --
+  -- Setup to test concurrent handling of GetTupleForTrigger().
+  --
+  CREATE TABLE footrg (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE triglog as select * from footrg;
+  CREATE TABLE footrg1 PARTITION OF footrg FOR VALUES IN (1);
+  CREATE TABLE footrg2 PARTITION OF footrg FOR VALUES IN (2);
+  INSERT INTO footrg VALUES (1, 'ABC');
+  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
+  BEGIN
+	 OLD.b = OLD.b || ' trigger';
+
+	 -- This will verify that the trigger is not run *before* the row is
+	 -- refetched by EvalPlanQual. The OLD row should contain the changes made
+	 -- by the concurrent session.
+     INSERT INTO triglog select OLD.*;
+
+     RETURN OLD;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg();
+
+}
+
+teardown
+{
+  DROP TABLE foo;
+  DROP TRIGGER footrg_ondel ON footrg1;
+  DROP FUNCTION func_footrg();
+  DROP TABLE footrg;
+  DROP TABLE triglog;
+}
+
+session "s1"
+step "s1b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"   { UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1ut"  { UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1s"   { SELECT tableoid::regclass, * FROM foo ORDER BY a; }
+step "s1st"  { SELECT tableoid::regclass, * FROM footrg ORDER BY a; }
+step "s1stl"  { SELECT * FROM triglog ORDER BY a; }
+step "s1c"	 { COMMIT; }
+
+session "s2"
+step "s2b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u1"  { UPDATE foo SET b = b || ' update2' WHERE a = 1; }
+step "s2u2"  { UPDATE foo SET b = 'EFG' WHERE a = 1; }
+step "s2ut1" { UPDATE footrg SET b = b || ' update2' WHERE a = 1; }
+step "s2ut2" { UPDATE footrg SET b = 'EFG' WHERE a = 1; }
+step "s2c"	 { COMMIT; }
+
+
+# Session A is moving a row into another partition, but is waiting for
+# another session B that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by session B.
+permutation "s1b" "s2b" "s2u1" "s1u" "s2c" "s1c" "s1s"
+
+# Same as above, except, session A is waiting in GetTupleTrigger().
+permutation "s1b" "s2b" "s2ut1" "s1ut" "s2c" "s1c" "s1st" "s1stl"
+
+# Below two cases are similar to the above two; except that the session A
+# fails EvalPlanQual() test, so partition key update does not happen.
+permutation "s1b" "s2b" "s2u2" "s1u" "s2c" "s1c" "s1s"
+permutation "s1b" "s2b" "s2ut2" "s1ut" "s2c" "s1c" "s1st" "s1stl"

#19

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#18)

Re: Concurrency bug in UPDATE of partition-key

On Tue, Jun 19, 2018 at 9:33 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Could not add RAISE statement, because the isolation test does not
seem to print those statements in the output. Instead, I have dumped
some rows in a new table through the trigger function, and did a
select from that table. Attached is v3 patch.

Comments
-----------------
1.
+ /*
+ * If the tuple was concurrently updated and the caller of this
+ * function requested for the updated tuple, skip the trigger
+ * execution.
+ */
+ if (newSlot != NULL && epqslot != NULL)
+ return false;

There is a possibility of memory leak due to above change. Basically,
GetTupleForTrigger returns a newly allocated tuple. We should free
the triggertuple by calling heap_freetuple(trigtuple).

2.
ExecBRDeleteTriggers(..)
{
..
+ /* If requested, pass back the concurrently updated tuple, if any. */
+ if (epqslot != NULL)
+ *epqslot = newSlot;
+
  if (trigtuple == NULL)
  return false;
+
+ /*
+ * If the tuple was concurrently updated and the caller of this
+ * function requested for the updated tuple, skip the trigger
+ * execution.
+ */
+ if (newSlot != NULL && epqslot != NULL)
+ return false;
..
}

Can't we merge the first change into second, something like:

if (newSlot != NULL && epqslot != NULL)
{
*epqslot = newSlot;
return false;
}

3.
ExecBRDeleteTriggers(..)
{
- TupleTableSlot *newSlot;
int i;

Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
if (fdw_trigtuple == NULL)
{
+ TupleTableSlot *newSlot;
+
..
}

This change is okay on its own, but I think it has nothing to do with
this patch. If you don't mind, can we remove it?

4.
+/* ----------
+ * ExecBRDeleteTriggers()
+ *
+ * Called to execute BEFORE ROW DELETE triggers.
+ *
+ * Returns false in following scenarios :
+ * 1. Trigger function returned NULL.
+ * 2. The tuple was concurrently deleted OR it was concurrently updated and the
+ * new tuple didn't pass EvalPlanQual() test.
+ * 3. The tuple was concurrently updated and the new tuple passed the
+ * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
+ * function skips the trigger function execution, because the caller has
+ * indicated that it wants to further process the updated tuple. The updated
+ * tuple slot is passed back through epqslot.
+ *
+ * In all other cases, returns true.
+ * ----------
+ */
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,

The above comments appear unrelated to this function, example, this
function is not at all aware of concurrent updates. I think if you
want to add comments to this function, we can add them as a separate
patch or if you want to add them as part of this patch at least make
them succinct.

5.
+ /*
+ * If this is part of an UPDATE of partition-key, the
+ * epq tuple will contain the changes from this
+ * transaction over and above the updates done by the
+ * other transaction. The caller should now use this
+ * tuple as its NEW tuple, rather than the earlier NEW
+ * tuple.
+ */

I think we can simplify this comment as "If requested, pass back the
updated tuple if any.", something similar to what you have at some
other place in the patch.

6.
+# Session A is moving a row into another partition, but is waiting for
+# another session B that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by session B.

You have named sessions as s1 and s2, but description uses A and B, it
will be better to use s1 and s2 respectively.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#20

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#19)

1 attachment(s)

Re: Concurrency bug in UPDATE of partition-key

On 20 June 2018 at 18:54, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 19, 2018 at 9:33 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Could not add RAISE statement, because the isolation test does not
seem to print those statements in the output. Instead, I have dumped
some rows in a new table through the trigger function, and did a
select from that table. Attached is v3 patch.
Comments
-----------------
1.
+ /*
+ * If the tuple was concurrently updated and the caller of this
+ * function requested for the updated tuple, skip the trigger
+ * execution.
+ */
+ if (newSlot != NULL && epqslot != NULL)
+ return false;
There is a possibility of memory leak due to above change. Basically,
GetTupleForTrigger returns a newly allocated tuple. We should free
the triggertuple by calling heap_freetuple(trigtuple).

Right, done.

2.
ExecBRDeleteTriggers(..)
{
..
+ /* If requested, pass back the concurrently updated tuple, if any. */
+ if (epqslot != NULL)
+ *epqslot = newSlot;
+
if (trigtuple == NULL)
return false;
+
+ /*
+ * If the tuple was concurrently updated and the caller of this
+ * function requested for the updated tuple, skip the trigger
+ * execution.
+ */
+ if (newSlot != NULL && epqslot != NULL)
+ return false;
..
}

Can't we merge the first change into second, something like:

if (newSlot != NULL && epqslot != NULL)
{
*epqslot = newSlot;
return false;
}

We want to update *epqslot with whatever the value of newSlot is. So
if newSlot is NULL, epqslot should be NULL. So we can't do that in the
"if (newSlot != NULL && epqslot != NULL)" condition.

3.
ExecBRDeleteTriggers(..)
{
- TupleTableSlot *newSlot;
int i;

Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
if (fdw_trigtuple == NULL)
{
+ TupleTableSlot *newSlot;
+
..
}

This change is okay on its own, but I think it has nothing to do with
this patch. If you don't mind, can we remove it?

Done.

4.
+/* ----------
+ * ExecBRDeleteTriggers()
+ *
+ * Called to execute BEFORE ROW DELETE triggers.
+ *
+ * Returns false in following scenarios :
+ * 1. Trigger function returned NULL.
+ * 2. The tuple was concurrently deleted OR it was concurrently updated and the
+ * new tuple didn't pass EvalPlanQual() test.
+ * 3. The tuple was concurrently updated and the new tuple passed the
+ * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
+ * function skips the trigger function execution, because the caller has
+ * indicated that it wants to further process the updated tuple. The updated
+ * tuple slot is passed back through epqslot.
+ *
+ * In all other cases, returns true.
+ * ----------
+ */
bool
ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,

The above comments appear unrelated to this function, example, this
function is not at all aware of concurrent updates.

But with the patch, this function does become aware of concurrent
updates, because it makes use of the epqslot returned by
GetTupleForTrigger, This is similar to ExecBRUpdateTriggers() being
aware of concurrent updates.

The reason I made the return-value-related comment is because earlier
it was simple : If trigger function returned NULL, this function
returns false. But now that has changed, so its better if the change
is noted in the function header. And on HEAD, there is no function
header, so we need to explain when exactly this function returns true
or false.

I think if you want to add comments to this function, we can add them as a separate
patch or if you want to add them as part of this patch at least make
them succinct.

If it looks complicated, can you please suggest something to make it a
bit simpler.

5.
+ /*
+ * If this is part of an UPDATE of partition-key, the
+ * epq tuple will contain the changes from this
+ * transaction over and above the updates done by the
+ * other transaction. The caller should now use this
+ * tuple as its NEW tuple, rather than the earlier NEW
+ * tuple.
+ */
I think we can simplify this comment as "If requested, pass back the
updated tuple if any.", something similar to what you have at some
other place in the patch.

Ok, Done. Also changed the subsequent comment to :
/* Normal DELETE: So delete the updated row. */

6.
+# Session A is moving a row into another partition, but is waiting for
+# another session B that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by session B.
You have named sessions as s1 and s2, but description uses A and B, it
will be better to use s1 and s2 respectively.

Done.

Attached is v4 version of the patch. Thanks !

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

fix_concurrency_bug_v4.patchapplication/octet-stream; name=fix_concurrency_bug_v4.patchDownload

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 57519fe..3969366 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2726,11 +2726,30 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 							  false, NULL, NULL, NIL, NULL, transition_capture);
 }
 
+/* ----------
+ * ExecBRDeleteTriggers()
+ *
+ * Called to execute BEFORE ROW DELETE triggers.
+ *
+ * Returns false in following scenarios :
+ * 1. Trigger function returned NULL.
+ * 2. The tuple was concurrently deleted OR it was concurrently updated and the
+ * new tuple didn't pass EvalPlanQual() test.
+ * 3. The tuple was concurrently updated and the new tuple passed the
+ * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
+ * function skips the trigger function execution, because the caller has
+ * indicated that it wants to further process the updated tuple. The updated
+ * tuple slot is passed back through epqslot.
+ *
+ * In all other cases, returns true.
+ * ----------
+ */
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple)
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot)
 {
 	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
 	bool		result = true;
@@ -2745,8 +2764,24 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	{
 		trigtuple = GetTupleForTrigger(estate, epqstate, relinfo, tupleid,
 									   LockTupleExclusive, &newSlot);
+
+		/* If requested, pass back the concurrently updated tuple, if any. */
+		if (epqslot != NULL)
+			*epqslot = newSlot;
+
 		if (trigtuple == NULL)
 			return false;
+
+		/*
+		 * If the tuple was concurrently updated and the caller of this
+		 * function requested for the updated tuple, skip the trigger
+		 * execution.
+		 */
+		if (newSlot != NULL && epqslot != NULL)
+		{
+			heap_freetuple(trigtuple);
+			return false;
+		}
 	}
 	else
 		trigtuple = fdw_trigtuple;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 41e857e..5006baa 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -531,7 +531,7 @@ ExecSimpleRelationDelete(EState *estate, EPQState *epqstate,
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
 										   &searchslot->tts_tuple->t_self,
-										   NULL);
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7e0b867..1a1c6ee 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -609,7 +609,9 @@ ExecInsert(ModifyTableState *mtstate,
  *		foreign table, tupleid is invalid; the FDW has to figure out
  *		which row to delete using data from the planSlot.  oldtuple is
  *		passed to foreign table triggers; it is NULL when the foreign
- *		table has no relevant triggers.
+ *		table has no relevant triggers. When this DELETE is a part of
+ *		an UPDATE of partition-key, then the slot returned by
+ *		EvalPlanQual() is passed back using output parameter epqslot.
  *
  *		Returns RETURNING result if any, otherwise NULL.
  * ----------------------------------------------------------------
@@ -622,6 +624,7 @@ ExecDelete(ModifyTableState *mtstate,
 		   EPQState *epqstate,
 		   EState *estate,
 		   bool *tupleDeleted,
+		   TupleTableSlot **epqslot,
 		   bool processReturning,
 		   bool canSetTag,
 		   bool changingPart)
@@ -649,7 +652,7 @@ ExecDelete(ModifyTableState *mtstate,
 		bool		dodelete;
 
 		dodelete = ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										tupleid, oldtuple);
+										tupleid, oldtuple, epqslot);
 
 		if (!dodelete)			/* "do nothing" */
 			return NULL;
@@ -769,19 +772,30 @@ ldelete:;
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
-					TupleTableSlot *epqslot;
+					TupleTableSlot *my_epqslot;
 
-					epqslot = EvalPlanQual(estate,
+					my_epqslot = EvalPlanQual(estate,
 										   epqstate,
 										   resultRelationDesc,
 										   resultRelInfo->ri_RangeTableIndex,
 										   LockTupleExclusive,
 										   &hufd.ctid,
 										   hufd.xmax);
-					if (!TupIsNull(epqslot))
+					if (!TupIsNull(my_epqslot))
 					{
 						*tupleid = hufd.ctid;
-						goto ldelete;
+
+						/* If requested, pass back the updated row. */
+						if (epqslot)
+						{
+							*epqslot = my_epqslot;
+							return NULL;
+						}
+						else
+						{
+							/* Normal DELETE: So delete the updated row. */
+							goto ldelete;
+						}
 					}
 				}
 				/* tuple already deleted; nothing to do */
@@ -1052,6 +1066,7 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleTableSlot *epqslot = NULL;
 			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
@@ -1081,7 +1096,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
-					   estate, &tuple_deleted, false,
+					   estate, &tuple_deleted, &epqslot, false,
 					   false /* canSetTag */ , true /* changingPart */ );
 
 			/*
@@ -1105,7 +1120,23 @@ lreplace:;
 			 * resurrect it.
 			 */
 			if (!tuple_deleted)
-				return NULL;
+			{
+				/*
+				 * epqslot will be typically NULL. But when ExecDelete() finds
+				 * that another transaction has concurrently updated the same
+				 * row, it re-fetches the row, skips the delete, and epqslot is
+				 * set to the re-fetched tuple slot. In that case, we need to
+				 * do all the checks again.
+				 */
+				if (TupIsNull(epqslot))
+					return NULL;
+				else
+				{
+					slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+					tuple = ExecMaterializeSlot(slot);
+					goto lreplace;
+				}
+			}
 
 			/*
 			 * Updates set the transition capture map only when a new subplan
@@ -2136,7 +2167,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag,
+								  NULL, NULL, true, node->canSetTag,
 								  false /* changingPart */ );
 				break;
 			default:
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index a5b8610..1031448 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -206,7 +206,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 					 EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple);
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot);
 extern void ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 0000000..774a7fa
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,60 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s2b s2u1 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u1: UPDATE foo SET b = b || ' update2' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo2           2              ABC update2 update1
+
+starting permutation: s1b s2b s2ut1 s1ut s2c s1c s1st s1stl
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut1: UPDATE footrg SET b = b || ' update2' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg2        2              ABC update2 update1
+step s1stl: SELECT * FROM triglog ORDER BY a;
+a              b              
+
+1              ABC update2 trigger
+
+starting permutation: s1b s2b s2u2 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u2: UPDATE foo SET b = 'EFG' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo1           1              EFG            
+
+starting permutation: s1b s2b s2ut2 s1ut s2c s1c s1st s1stl
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut2: UPDATE footrg SET b = 'EFG' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg1        1              EFG            
+step s1stl: SELECT * FROM triglog ORDER BY a;
+a              b              
+
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 0e99721..d5594e8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -74,4 +74,5 @@ test: predicate-gin-nomatch
 test: partition-key-update-1
 test: partition-key-update-2
 test: partition-key-update-3
+test: partition-key-update-4
 test: plpgsql-toast
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 0000000..1d53a7d
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,76 @@
+# Test that a row that ends up in a new partition contains changes made by
+# a concurrent transaction.
+
+setup
+{
+  --
+  -- Setup to test concurrent handling of ExecDelete().
+  --
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+
+  --
+  -- Setup to test concurrent handling of GetTupleForTrigger().
+  --
+  CREATE TABLE footrg (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE triglog as select * from footrg;
+  CREATE TABLE footrg1 PARTITION OF footrg FOR VALUES IN (1);
+  CREATE TABLE footrg2 PARTITION OF footrg FOR VALUES IN (2);
+  INSERT INTO footrg VALUES (1, 'ABC');
+  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
+  BEGIN
+	 OLD.b = OLD.b || ' trigger';
+
+	 -- This will verify that the trigger is not run *before* the row is
+	 -- refetched by EvalPlanQual. The OLD row should contain the changes made
+	 -- by the concurrent session.
+     INSERT INTO triglog select OLD.*;
+
+     RETURN OLD;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg();
+
+}
+
+teardown
+{
+  DROP TABLE foo;
+  DROP TRIGGER footrg_ondel ON footrg1;
+  DROP FUNCTION func_footrg();
+  DROP TABLE footrg;
+  DROP TABLE triglog;
+}
+
+session "s1"
+step "s1b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"   { UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1ut"  { UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1s"   { SELECT tableoid::regclass, * FROM foo ORDER BY a; }
+step "s1st"  { SELECT tableoid::regclass, * FROM footrg ORDER BY a; }
+step "s1stl"  { SELECT * FROM triglog ORDER BY a; }
+step "s1c"	 { COMMIT; }
+
+session "s2"
+step "s2b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u1"  { UPDATE foo SET b = b || ' update2' WHERE a = 1; }
+step "s2u2"  { UPDATE foo SET b = 'EFG' WHERE a = 1; }
+step "s2ut1" { UPDATE footrg SET b = b || ' update2' WHERE a = 1; }
+step "s2ut2" { UPDATE footrg SET b = 'EFG' WHERE a = 1; }
+step "s2c"	 { COMMIT; }
+
+
+# Session s1 is moving a row into another partition, but is waiting for
+# another session s2 that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by session s2.
+permutation "s1b" "s2b" "s2u1" "s1u" "s2c" "s1c" "s1s"
+
+# Same as above, except, session s1 is waiting in GetTupleTrigger().
+permutation "s1b" "s2b" "s2ut1" "s1ut" "s2c" "s1c" "s1st" "s1stl"
+
+# Below two cases are similar to the above two; except that the session s1
+# fails EvalPlanQual() test, so partition key update does not happen.
+permutation "s1b" "s2b" "s2u2" "s1u" "s2c" "s1c" "s1s"
+permutation "s1b" "s2b" "s2ut2" "s1ut" "s2c" "s1c" "s1st" "s1stl"

#21

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#20)

1 attachment(s)

Re: Concurrency bug in UPDATE of partition-key

On Fri, Jun 22, 2018 at 10:25 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 20 June 2018 at 18:54, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jun 19, 2018 at 9:33 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
2.
ExecBRDeleteTriggers(..)
{
..
+ /* If requested, pass back the concurrently updated tuple, if any. */
+ if (epqslot != NULL)
+ *epqslot = newSlot;
+
if (trigtuple == NULL)
return false;
+
+ /*
+ * If the tuple was concurrently updated and the caller of this
+ * function requested for the updated tuple, skip the trigger
+ * execution.
+ */
+ if (newSlot != NULL && epqslot != NULL)
+ return false;
..
}
Can't we merge the first change into second, something like:

if (newSlot != NULL && epqslot != NULL)
{
*epqslot = newSlot;
return false;
}
We want to update *epqslot with whatever the value of newSlot is. So
if newSlot is NULL, epqslot should be NULL. So we can't do that in the
"if (newSlot != NULL && epqslot != NULL)" condition.

Why do you need to update when newslot is NULL? Already *epqslot is
initialized as NULL in the caller (ExecUpdate). I think we only want
to update it when trigtuple is not null. Apart from looking bit
awkward, I think it is error-prone as well because the code of
GetTupleForTrigger is such that it can return NULL with a valid value
of newSlot in which case the behavior will become undefined. The case
which I am worried is when first-time EvalPlanQual returns some valid
value of epqslot, but in the next execution after heap_lock_tuple,
returns NULL. In practise, it won't happen because EvalPlanQual locks
the tuple, so after that heap_lock_tuple shouldn't fail again, but it
seems prudent to not rely on that unless we need to.

For now, I have removed it in the attached patch, but if you have any
valid reason, then we can add back.

4.
+/* ----------
+ * ExecBRDeleteTriggers()
+ *
+ * Called to execute BEFORE ROW DELETE triggers.
+ *
+ * Returns false in following scenarios :
+ * 1. Trigger function returned NULL.
+ * 2. The tuple was concurrently deleted OR it was concurrently updated and the
+ * new tuple didn't pass EvalPlanQual() test.
+ * 3. The tuple was concurrently updated and the new tuple passed the
+ * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
+ * function skips the trigger function execution, because the caller has
+ * indicated that it wants to further process the updated tuple. The updated
+ * tuple slot is passed back through epqslot.
+ *
+ * In all other cases, returns true.
+ * ----------
+ */
bool
ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,

If it looks complicated, can you please suggest something to make it a
bit simpler.

See attached.

Apart from this, I have changed few comments and fixed indentation
issues. Let me know what you think about attached?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_concurrency_bug_v5.patchapplication/octet-stream; name=fix_concurrency_bug_v5.patchDownload

From 58a1e12f1740680ae810e5a42c8de59d4e348549 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 23 Jun 2018 15:32:53 +0530
Subject: [PATCH] Allow using the updated tuple while moving it to a different
 partition.

An update that causes the tuple to be moved to a different partition was
missing out on re-constructing the to-be-updated tuple, based on the latest
tuple in the update chain.  Instead, it's simply deleting the latest tuple
and inserting a new tuple in the new partition based on the old tuple.
Commit 2f17844104 didn't consider this case, so some of the updates were
getting lost.

Reported-by: Pavan Deolasee
Author: Amit Khandekar, with minor changes by me
Reviewed-by: Dilip Kumar and Amit Kapila
Discussion: https://postgr.es/m/CAJ3gD9fRbEzDqdeDq1jxqZUb47kJn+tQ7=Bcgjc8quqKsDViKQ@mail.gmail.com
---
 src/backend/commands/trigger.c                     | 22 ++++++-
 src/backend/executor/execReplication.c             |  2 +-
 src/backend/executor/nodeModifyTable.c             | 63 +++++++++++++-----
 src/include/commands/trigger.h                     |  3 +-
 .../isolation/expected/partition-key-update-4.out  | 60 +++++++++++++++++
 src/test/isolation/isolation_schedule              |  1 +
 .../isolation/specs/partition-key-update-4.spec    | 76 ++++++++++++++++++++++
 7 files changed, 208 insertions(+), 19 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-key-update-4.out
 create mode 100644 src/test/isolation/specs/partition-key-update-4.spec

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 57519fe..2436692 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2726,11 +2726,19 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 							  false, NULL, NULL, NIL, NULL, transition_capture);
 }
 
+/*
+ * Execute BEFORE ROW DELETE triggers.
+ *
+ * True indicates caller can proceed with the delete.  False indicates caller
+ * need to suppress the delete and additionally if requested, we need to pass
+ * back the concurrently updated tuple if any.
+ */
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple)
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot)
 {
 	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
 	bool		result = true;
@@ -2747,6 +2755,18 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 									   LockTupleExclusive, &newSlot);
 		if (trigtuple == NULL)
 			return false;
+
+		/*
+		 * If the tuple was concurrently updated and the caller of this
+		 * function requested for the updated tuple, skip the trigger
+		 * execution.
+		 */
+		if (newSlot != NULL && epqslot != NULL)
+		{
+			*epqslot = newSlot;
+			heap_freetuple(trigtuple);
+			return false;
+		}
 	}
 	else
 		trigtuple = fdw_trigtuple;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 41e857e..5006baa 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -531,7 +531,7 @@ ExecSimpleRelationDelete(EState *estate, EPQState *epqstate,
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
 										   &searchslot->tts_tuple->t_self,
-										   NULL);
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7e0b867..58bd0f1 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -609,7 +609,9 @@ ExecInsert(ModifyTableState *mtstate,
  *		foreign table, tupleid is invalid; the FDW has to figure out
  *		which row to delete using data from the planSlot.  oldtuple is
  *		passed to foreign table triggers; it is NULL when the foreign
- *		table has no relevant triggers.
+ *		table has no relevant triggers. When this DELETE is a part of
+ *		an UPDATE of partition-key, then the slot returned by
+ *		EvalPlanQual() is passed back using output parameter epqslot.
  *
  *		Returns RETURNING result if any, otherwise NULL.
  * ----------------------------------------------------------------
@@ -622,6 +624,7 @@ ExecDelete(ModifyTableState *mtstate,
 		   EPQState *epqstate,
 		   EState *estate,
 		   bool *tupleDeleted,
+		   TupleTableSlot **epqslot,
 		   bool processReturning,
 		   bool canSetTag,
 		   bool changingPart)
@@ -649,7 +652,7 @@ ExecDelete(ModifyTableState *mtstate,
 		bool		dodelete;
 
 		dodelete = ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										tupleid, oldtuple);
+										tupleid, oldtuple, epqslot);
 
 		if (!dodelete)			/* "do nothing" */
 			return NULL;
@@ -769,19 +772,30 @@ ldelete:;
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
-					TupleTableSlot *epqslot;
-
-					epqslot = EvalPlanQual(estate,
-										   epqstate,
-										   resultRelationDesc,
-										   resultRelInfo->ri_RangeTableIndex,
-										   LockTupleExclusive,
-										   &hufd.ctid,
-										   hufd.xmax);
-					if (!TupIsNull(epqslot))
+					TupleTableSlot *my_epqslot;
+
+					my_epqslot = EvalPlanQual(estate,
+											  epqstate,
+											  resultRelationDesc,
+											  resultRelInfo->ri_RangeTableIndex,
+											  LockTupleExclusive,
+											  &hufd.ctid,
+											  hufd.xmax);
+					if (!TupIsNull(my_epqslot))
 					{
 						*tupleid = hufd.ctid;
-						goto ldelete;
+
+						/*
+						 * If requested, skip delete and pass back the updated
+						 * row.
+						 */
+						if (epqslot)
+						{
+							*epqslot = my_epqslot;
+							return NULL;
+						}
+						else
+							goto ldelete;
 					}
 				}
 				/* tuple already deleted; nothing to do */
@@ -1052,6 +1066,7 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleTableSlot *epqslot = NULL;
 			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
@@ -1081,7 +1096,7 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
-					   estate, &tuple_deleted, false,
+					   estate, &tuple_deleted, &epqslot, false,
 					   false /* canSetTag */ , true /* changingPart */ );
 
 			/*
@@ -1105,7 +1120,23 @@ lreplace:;
 			 * resurrect it.
 			 */
 			if (!tuple_deleted)
-				return NULL;
+			{
+				/*
+				 * epqslot will be typically NULL.  But when ExecDelete()
+				 * finds that another transaction has concurrently updated the
+				 * same row, it re-fetches the row, skips the delete, and
+				 * epqslot is set to the re-fetched tuple slot. In that case,
+				 * we need to do all the checks again.
+				 */
+				if (TupIsNull(epqslot))
+					return NULL;
+				else
+				{
+					slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+					tuple = ExecMaterializeSlot(slot);
+					goto lreplace;
+				}
+			}
 
 			/*
 			 * Updates set the transition capture map only when a new subplan
@@ -2136,7 +2167,7 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag,
+								  NULL, NULL, true, node->canSetTag,
 								  false /* changingPart */ );
 				break;
 			default:
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index a5b8610..1031448 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -206,7 +206,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 					 EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple);
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot);
 extern void ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 0000000..774a7fa
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,60 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s2b s2u1 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u1: UPDATE foo SET b = b || ' update2' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo2           2              ABC update2 update1
+
+starting permutation: s1b s2b s2ut1 s1ut s2c s1c s1st s1stl
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut1: UPDATE footrg SET b = b || ' update2' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg2        2              ABC update2 update1
+step s1stl: SELECT * FROM triglog ORDER BY a;
+a              b              
+
+1              ABC update2 trigger
+
+starting permutation: s1b s2b s2u2 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u2: UPDATE foo SET b = 'EFG' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo1           1              EFG            
+
+starting permutation: s1b s2b s2ut2 s1ut s2c s1c s1st s1stl
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut2: UPDATE footrg SET b = 'EFG' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg1        1              EFG            
+step s1stl: SELECT * FROM triglog ORDER BY a;
+a              b              
+
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 0e99721..d5594e8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -74,4 +74,5 @@ test: predicate-gin-nomatch
 test: partition-key-update-1
 test: partition-key-update-2
 test: partition-key-update-3
+test: partition-key-update-4
 test: plpgsql-toast
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 0000000..1d53a7d
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,76 @@
+# Test that a row that ends up in a new partition contains changes made by
+# a concurrent transaction.
+
+setup
+{
+  --
+  -- Setup to test concurrent handling of ExecDelete().
+  --
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+
+  --
+  -- Setup to test concurrent handling of GetTupleForTrigger().
+  --
+  CREATE TABLE footrg (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE triglog as select * from footrg;
+  CREATE TABLE footrg1 PARTITION OF footrg FOR VALUES IN (1);
+  CREATE TABLE footrg2 PARTITION OF footrg FOR VALUES IN (2);
+  INSERT INTO footrg VALUES (1, 'ABC');
+  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
+  BEGIN
+	 OLD.b = OLD.b || ' trigger';
+
+	 -- This will verify that the trigger is not run *before* the row is
+	 -- refetched by EvalPlanQual. The OLD row should contain the changes made
+	 -- by the concurrent session.
+     INSERT INTO triglog select OLD.*;
+
+     RETURN OLD;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg();
+
+}
+
+teardown
+{
+  DROP TABLE foo;
+  DROP TRIGGER footrg_ondel ON footrg1;
+  DROP FUNCTION func_footrg();
+  DROP TABLE footrg;
+  DROP TABLE triglog;
+}
+
+session "s1"
+step "s1b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"   { UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1ut"  { UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1s"   { SELECT tableoid::regclass, * FROM foo ORDER BY a; }
+step "s1st"  { SELECT tableoid::regclass, * FROM footrg ORDER BY a; }
+step "s1stl"  { SELECT * FROM triglog ORDER BY a; }
+step "s1c"	 { COMMIT; }
+
+session "s2"
+step "s2b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u1"  { UPDATE foo SET b = b || ' update2' WHERE a = 1; }
+step "s2u2"  { UPDATE foo SET b = 'EFG' WHERE a = 1; }
+step "s2ut1" { UPDATE footrg SET b = b || ' update2' WHERE a = 1; }
+step "s2ut2" { UPDATE footrg SET b = 'EFG' WHERE a = 1; }
+step "s2c"	 { COMMIT; }
+
+
+# Session s1 is moving a row into another partition, but is waiting for
+# another session s2 that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by session s2.
+permutation "s1b" "s2b" "s2u1" "s1u" "s2c" "s1c" "s1s"
+
+# Same as above, except, session s1 is waiting in GetTupleTrigger().
+permutation "s1b" "s2b" "s2ut1" "s1ut" "s2c" "s1c" "s1st" "s1stl"
+
+# Below two cases are similar to the above two; except that the session s1
+# fails EvalPlanQual() test, so partition key update does not happen.
+permutation "s1b" "s2b" "s2u2" "s1u" "s2c" "s1c" "s1s"
+permutation "s1b" "s2b" "s2ut2" "s1ut" "s2c" "s1c" "s1st" "s1stl"
-- 
1.8.3.1

#22

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Amit Kapila (#21)

Re: Concurrency bug in UPDATE of partition-key

Would you wait a little bit before pushing this? I'd like to give this
a read too.

Thanks

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#23

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Alvaro Herrera (#22)

Re: Concurrency bug in UPDATE of partition-key

On Sat, Jun 23, 2018 at 8:25 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Would you wait a little bit before pushing this?

Sure.

I'd like to give this
a read too.

That would be great.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#24

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#21)

Re: Concurrency bug in UPDATE of partition-key

On 23 June 2018 at 15:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 22, 2018 at 10:25 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 20 June 2018 at 18:54, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jun 19, 2018 at 9:33 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
2.
ExecBRDeleteTriggers(..)
{
..
+ /* If requested, pass back the concurrently updated tuple, if any. */
+ if (epqslot != NULL)
+ *epqslot = newSlot;
+
if (trigtuple == NULL)
return false;
+
+ /*
+ * If the tuple was concurrently updated and the caller of this
+ * function requested for the updated tuple, skip the trigger
+ * execution.
+ */
+ if (newSlot != NULL && epqslot != NULL)
+ return false;
..
}
Can't we merge the first change into second, something like:

if (newSlot != NULL && epqslot != NULL)
{
*epqslot = newSlot;
return false;
}
We want to update *epqslot with whatever the value of newSlot is. So
if newSlot is NULL, epqslot should be NULL. So we can't do that in the
"if (newSlot != NULL && epqslot != NULL)" condition.
Why do you need to update when newslot is NULL? Already *epqslot is
initialized as NULL in the caller (ExecUpdate). I think we only want
to update it when trigtuple is not null.

But GetTupleForTrigger() updates the epqslot to NULL even when it
returns NULL. So to be consistent with it, we want to do the same
thing for ExecBRDeleteTriggers() : Update the epqslot even when
GetTupleForTrigger() returns NULL.

I think the reason we are doing "*newSlot = NULL" as the very first
thing in the GetTupleForTrigger() code, is so that callers don't have
to initialise newSlot to NULL before calling GetTupleForTrigger. And
so ExecBRUpdateTriggers() does not initialize newSlot with NULL. We
can follow the same approach while calling ExecDelete().

I understand that before calling ExecDelete() epqslot is initialized
to NULL, so it is again not required to do the same inside
GetTupleForTrigger(), but as per my above explanation, it is actually
not necessary to initialize to NULL before calling ExecDelete(). So I
can even remove that initialization.

Apart from looking bit
awkward, I think it is error-prone as well because the code of
GetTupleForTrigger is such that it can return NULL with a valid value
of newSlot in which case the behavior will become undefined. The case
which I am worried is when first-time EvalPlanQual returns some valid
value of epqslot, but in the next execution after heap_lock_tuple,
returns NULL. In practise, it won't happen because EvalPlanQual locks
the tuple, so after that heap_lock_tuple shouldn't fail again, but it
seems prudent to not rely on that unless we need to.

Yes, I had given a thought on exactly this. I think the intention of
the code is to pass back NULL epqslot when there is no concurrently
updated tuple. I think the code in GetTupleForTrigger() is well aware
that EvalPlanQual() will never be called twice. The only reason it
goes back to ltrmark is to re-fetch the locked tuple. The comment also
says so :
/*
* EvalPlanQual already locked the tuple, but we
* re-call heap_lock_tuple anyway as an easy way of
* re-fetching the correct tuple. Speed is hardly a
* criterion in this path anyhow.
*/

So newSlot is supposed to be updated only once.

For now, I have removed it in the attached patch, but if you have any
valid reason, then we can add back.

4.
+/* ----------
+ * ExecBRDeleteTriggers()
+ *
+ * Called to execute BEFORE ROW DELETE triggers.
+ *
+ * Returns false in following scenarios :
+ * 1. Trigger function returned NULL.
+ * 2. The tuple was concurrently deleted OR it was concurrently updated and the
+ * new tuple didn't pass EvalPlanQual() test.
+ * 3. The tuple was concurrently updated and the new tuple passed the
+ * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
+ * function skips the trigger function execution, because the caller has
+ * indicated that it wants to further process the updated tuple. The updated
+ * tuple slot is passed back through epqslot.
+ *
+ * In all other cases, returns true.
+ * ----------
+ */
bool
ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,

If it looks complicated, can you please suggest something to make it a
bit simpler.

See attached.

Apart from this, I have changed few comments and fixed indentation
issues. Let me know what you think about attached?

Yes, the changes look good, except for the above one point on which we
haven't concluded yet.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#25

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#24)

Re: Concurrency bug in UPDATE of partition-key

On Mon, Jun 25, 2018 at 3:06 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 23 June 2018 at 15:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why do you need to update when newslot is NULL? Already *epqslot is
initialized as NULL in the caller (ExecUpdate). I think we only want
to update it when trigtuple is not null.

But GetTupleForTrigger() updates the epqslot to NULL even when it
returns NULL. So to be consistent with it, we want to do the same
thing for ExecBRDeleteTriggers() : Update the epqslot even when
GetTupleForTrigger() returns NULL.

I think the reason we are doing "*newSlot = NULL" as the very first
thing in the GetTupleForTrigger() code, is so that callers don't have
to initialise newSlot to NULL before calling GetTupleForTrigger. And
so ExecBRUpdateTriggers() does not initialize newSlot with NULL. We
can follow the same approach while calling ExecDelete().

Yeah, we can do that if it is required. I see your point, but I feel
that is making code bit less readable.

I understand that before calling ExecDelete() epqslot is initialized
to NULL, so it is again not required to do the same inside
GetTupleForTrigger(), but as per my above explanation, it is actually
not necessary to initialize to NULL before calling ExecDelete(). So I
can even remove that initialization.

If you remove that initialization, then won't you need to do something
in ExecDelete() as well because there the patch assigns a value to
epqslot only if EvalPlanQual returns non-null value?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#26

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#25)

Re: Concurrency bug in UPDATE of partition-key

On 25 June 2018 at 17:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 25, 2018 at 3:06 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 23 June 2018 at 15:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why do you need to update when newslot is NULL? Already *epqslot is
initialized as NULL in the caller (ExecUpdate). I think we only want
to update it when trigtuple is not null.

But GetTupleForTrigger() updates the epqslot to NULL even when it
returns NULL. So to be consistent with it, we want to do the same
thing for ExecBRDeleteTriggers() : Update the epqslot even when
GetTupleForTrigger() returns NULL.

I think the reason we are doing "*newSlot = NULL" as the very first
thing in the GetTupleForTrigger() code, is so that callers don't have
to initialise newSlot to NULL before calling GetTupleForTrigger. And
so ExecBRUpdateTriggers() does not initialize newSlot with NULL. We
can follow the same approach while calling ExecDelete().

Yeah, we can do that if it is required.

It is not required as such, and there is no correctness issue.

I see your point, but I feel that is making code bit less readable.

I did it that way only to be consistent with the existing trigger.c
code, namely ExecBRUpdateTriggers() where it sets newSlot to NULL
immediately. If you find some appropriate comments to make that
snippet more readable, that can work.

Or else, we can go ahead with the way you did it in your patch; and
additionally mention in the ExecBRDeleteTriggers() header comments
that epqslot value is undefined if there was no concurrently updated
tuple passed. To me, explaining this last part seems confusing.

I understand that before calling ExecDelete() epqslot is initialized
to NULL, so it is again not required to do the same inside
GetTupleForTrigger(), but as per my above explanation, it is actually
not necessary to initialize to NULL before calling ExecDelete(). So I
can even remove that initialization.

If you remove that initialization, then won't you need to do something
in ExecDelete() as well because there the patch assigns a value to
epqslot only if EvalPlanQual returns non-null value?

Ah, right. So skipping the initialization before calling ExecDelete()
is not a good idea then.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#27

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#26)

Re: Concurrency bug in UPDATE of partition-key

On Tue, Jun 26, 2018 at 11:40 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 25 June 2018 at 17:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 25, 2018 at 3:06 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 23 June 2018 at 15:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why do you need to update when newslot is NULL? Already *epqslot is
initialized as NULL in the caller (ExecUpdate). I think we only want
to update it when trigtuple is not null.

But GetTupleForTrigger() updates the epqslot to NULL even when it
returns NULL. So to be consistent with it, we want to do the same
thing for ExecBRDeleteTriggers() : Update the epqslot even when
GetTupleForTrigger() returns NULL.

I think the reason we are doing "*newSlot = NULL" as the very first
thing in the GetTupleForTrigger() code, is so that callers don't have
to initialise newSlot to NULL before calling GetTupleForTrigger. And
so ExecBRUpdateTriggers() does not initialize newSlot with NULL. We
can follow the same approach while calling ExecDelete().

Yeah, we can do that if it is required.

It is not required as such, and there is no correctness issue.

I see your point, but I feel that is making code bit less readable.

I did it that way only to be consistent with the existing trigger.c
code, namely ExecBRUpdateTriggers() where it sets newSlot to NULL
immediately. If you find some appropriate comments to make that
snippet more readable, that can work.

Or else, we can go ahead with the way you did it in your patch; and
additionally mention in the ExecBRDeleteTriggers() header comments
that epqslot value is undefined if there was no concurrently updated
tuple passed. To me, explaining this last part seems confusing.

Yeah, so let's leave it as it is in the patch. I think now we should
wait and see what Alvaro has to say about the overall patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#28

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#27)

Re: Concurrency bug in UPDATE of partition-key

On 26 June 2018 at 17:53, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, so let's leave it as it is in the patch.

Ok.

I think now we should wait and see what Alvaro has to say about the overall patch.

Yeah, that's very good that Alvaro is also having a look at this.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#29

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Amit Kapila (#21)

Re: Concurrency bug in UPDATE of partition-key

I was a bit surprised by the new epqslot output argument being added,
and now I think I know why: we already have es_trig_tuple_slot, so
shouldn't we be using that here instead? Seems like it'd all be simpler ...

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Alvaro Herrera (#29)

Re: Concurrency bug in UPDATE of partition-key

On Fri, Jun 29, 2018 at 11:22 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I was a bit surprised by the new epqslot output argument being added,
and now I think I know why: we already have es_trig_tuple_slot, so
shouldn't we be using that here instead? Seems like it'd all be simpler ...

Hmm, maybe, but not sure if it will be simpler. The point is that we
don't need to always return the epqslot, it will only be returned for
the special case, so you might need to use an additional boolean
variable to indicate when to fill the epqslot or someway indicate the
same via es_trig_tuple_slot. I think even if we somehow do that, we
need to do something additional like taking tuple from epqslot and
store it in es_trig_tuple_slot as I am not sure if we can directly
assign the slot returned by EvalPlanQual to es_trig_tuple_slot.
OTOH, the approach used by Amit Khandekar seems somewhat better as you
can directly return the slot returned by EvalPlanQual in the output
parameter. IIUC, the same technique is already used by
GetTupleForTrigger where it returns the epqslot in an additional
parameter.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#31

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#30)

Re: Concurrency bug in UPDATE of partition-key

On 30 June 2018 at 19:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 29, 2018 at 11:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I was a bit surprised by the new epqslot output argument being added,
and now I think I know why: we already have es_trig_tuple_slot, so
shouldn't we be using that here instead? Seems like it'd all be simpler ...

es_trig_tuple_slot is already allocated in ExecInitModifyTable(). And
the slot returned by EvalPlanQual is a separately allocated tuple
slot. I didn't get how we can assign the epqslot to
estate->es_trig_tuple_slot. That would mean throwing away the already
allocated es_trig_tuple_slot. I believe, the es_trig_tuple_slot
variable is not used for assigning already allocated slots to it.

Hmm, maybe, but not sure if it will be simpler. The point is that we
don't need to always return the epqslot, it will only be returned for
the special case, so you might need to use an additional boolean
variable to indicate when to fill the epqslot or someway indicate the
same via es_trig_tuple_slot. I think even if we somehow do that, we
need to do something additional like taking tuple from epqslot and
store it in es_trig_tuple_slot as I am not sure if we can directly
assign the slot returned by EvalPlanQual to es_trig_tuple_slot.

Right, I think making use of es_trig_tuple_slot will cause redundancy
in our case, because epqslot is a separately allocated slot; so it
makes sense to pass it back separately.

OTOH, the approach used by Amit Khandekar seems somewhat better as you
can directly return the slot returned by EvalPlanQual in the output
parameter. IIUC, the same technique is already used by
GetTupleForTrigger where it returns the epqslot in an additional
parameter.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#32

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#31)

Re: Concurrency bug in UPDATE of partition-key

On Mon, Jul 2, 2018 at 5:06 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 30 June 2018 at 19:20, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 29, 2018 at 11:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I was a bit surprised by the new epqslot output argument being added,
and now I think I know why: we already have es_trig_tuple_slot, so
shouldn't we be using that here instead? Seems like it'd all be simpler ...

es_trig_tuple_slot is already allocated in ExecInitModifyTable(). And
the slot returned by EvalPlanQual is a separately allocated tuple
slot. I didn't get how we can assign the epqslot to
estate->es_trig_tuple_slot. That would mean throwing away the already
allocated es_trig_tuple_slot. I believe, the es_trig_tuple_slot
variable is not used for assigning already allocated slots to it.

Hmm, maybe, but not sure if it will be simpler. The point is that we
don't need to always return the epqslot, it will only be returned for
the special case, so you might need to use an additional boolean
variable to indicate when to fill the epqslot or someway indicate the
same via es_trig_tuple_slot. I think even if we somehow do that, we
need to do something additional like taking tuple from epqslot and
store it in es_trig_tuple_slot as I am not sure if we can directly
assign the slot returned by EvalPlanQual to es_trig_tuple_slot.

Right, I think making use of es_trig_tuple_slot will cause redundancy
in our case, because epqslot is a separately allocated slot; so it
makes sense to pass it back separately.

Alvaro,

Can you please comment whether this addresses your concern?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#33

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Amit Kapila (#32)

Re: Concurrency bug in UPDATE of partition-key

On 2018-Jul-09, Amit Kapila wrote:

Alvaro,

Can you please comment whether this addresses your concern?

I was thinking that it would be a matter of passing the tuple slot to
EvalPlanQual for it to fill, rather than requiring it to fill some other
slot obtained from who-knows-where, but I realize now that that's nigh
impossible. Thanks for the explanation and patience.

What bothers me about this whole business is how ExecBRDeleteTriggers
and ExecDelete are now completely different from their sibling routines,
but maybe there's no helping that.

Please move the output arguments at the end of argument lists; also, it
would be great if you add commentary about ExecDelete other undocumented
arguments (tupleDeleted in particular) while you're in the vicinity.

Thanks

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Alvaro Herrera (#33)

Re: Concurrency bug in UPDATE of partition-key

On Wed, Jul 11, 2018 at 5:59 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2018-Jul-09, Amit Kapila wrote:

Alvaro,

Can you please comment whether this addresses your concern?

I was thinking that it would be a matter of passing the tuple slot to
EvalPlanQual for it to fill, rather than requiring it to fill some other
slot obtained from who-knows-where, but I realize now that that's nigh
impossible.

Right, giving EvalPlanQual the responsibility to use the output slot
can easily turn into a huge work without much benefit.

Thanks for the explanation and patience.

What bothers me about this whole business is how ExecBRDeleteTriggers
and ExecDelete are now completely different from their sibling routines,
but maybe there's no helping that.

Yeah, I think the differences started appearing when we decide to
overload ExecDelete for the usage of update-partition key, however,
the alternative would have been to write a new equivalent function
which can create a lot of code duplication.

Please move the output arguments at the end of argument lists;

make sense.

also, it
would be great if you add commentary about ExecDelete other undocumented
arguments (tupleDeleted in particular) while you're in the vicinity.

We already have some commentary in the caller of ExecDelete ("For some
reason if DELETE didn't happen ..."), but I think it will be clear if
we can add some comments atop function ExecDelete. I will send the
updated patch shortly.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#35

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Amit Kapila (#34)

1 attachment(s)

Re: Concurrency bug in UPDATE of partition-key

On Wed, Jul 11, 2018 at 8:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 11, 2018 at 5:59 AM, Alvaro Herrera

Please move the output arguments at the end of argument lists;

make sense.

also, it
would be great if you add commentary about ExecDelete other undocumented
arguments (tupleDeleted in particular) while you're in the vicinity.

We already have some commentary in the caller of ExecDelete ("For some
reason if DELETE didn't happen ..."), but I think it will be clear if
we can add some comments atop function ExecDelete. I will send the
updated patch shortly.

Attached, please find an updated patch based on comments by Alvaro.
See, if this looks okay to you guys.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_concurrency_bug_v6.patchapplication/octet-stream; name=fix_concurrency_bug_v6.patchDownload

From 969de94ba5b515a50e62ecf11e2bba522768079d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 11 Jul 2018 09:25:31 +0530
Subject: [PATCH] Allow using the updated tuple while moving it to a different
 partition.

An update that causes the tuple to be moved to a different partition was
missing out on re-constructing the to-be-updated tuple, based on the latest
tuple in the update chain.  Instead, it's simply deleting the latest tuple
and inserting a new tuple in the new partition based on the old tuple.
Commit 2f17844104 didn't consider this case, so some of the updates were
getting lost.

In passing, change the argument order for output parameter in ExecDelete
and add some commentary about it.

Reported-by: Pavan Deolasee
Author: Amit Khandekar, with minor changes by me
Reviewed-by: Dilip Kumar, Amit Kapila and Alvaro Herrera
Backpatch-through: 11
Discussion: https://postgr.es/m/CAJ3gD9fRbEzDqdeDq1jxqZUb47kJn+tQ7=Bcgjc8quqKsDViKQ@mail.gmail.com
---
 src/backend/commands/trigger.c                     | 22 ++++++-
 src/backend/executor/execReplication.c             |  2 +-
 src/backend/executor/nodeModifyTable.c             | 73 +++++++++++++++------
 src/include/commands/trigger.h                     |  3 +-
 .../isolation/expected/partition-key-update-4.out  | 60 +++++++++++++++++
 src/test/isolation/isolation_schedule              |  1 +
 .../isolation/specs/partition-key-update-4.spec    | 76 ++++++++++++++++++++++
 7 files changed, 214 insertions(+), 23 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-key-update-4.out
 create mode 100644 src/test/isolation/specs/partition-key-update-4.spec

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 57519fe..2436692 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2726,11 +2726,19 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 							  false, NULL, NULL, NIL, NULL, transition_capture);
 }
 
+/*
+ * Execute BEFORE ROW DELETE triggers.
+ *
+ * True indicates caller can proceed with the delete.  False indicates caller
+ * need to suppress the delete and additionally if requested, we need to pass
+ * back the concurrently updated tuple if any.
+ */
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple)
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot)
 {
 	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
 	bool		result = true;
@@ -2747,6 +2755,18 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 									   LockTupleExclusive, &newSlot);
 		if (trigtuple == NULL)
 			return false;
+
+		/*
+		 * If the tuple was concurrently updated and the caller of this
+		 * function requested for the updated tuple, skip the trigger
+		 * execution.
+		 */
+		if (newSlot != NULL && epqslot != NULL)
+		{
+			*epqslot = newSlot;
+			heap_freetuple(trigtuple);
+			return false;
+		}
 	}
 	else
 		trigtuple = fdw_trigtuple;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fad6df0..9a7dedf 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -531,7 +531,7 @@ ExecSimpleRelationDelete(EState *estate, EPQState *epqstate,
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
 										   &searchslot->tts_tuple->t_self,
-										   NULL);
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7e0b867..779b3d4 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -609,7 +609,11 @@ ExecInsert(ModifyTableState *mtstate,
  *		foreign table, tupleid is invalid; the FDW has to figure out
  *		which row to delete using data from the planSlot.  oldtuple is
  *		passed to foreign table triggers; it is NULL when the foreign
- *		table has no relevant triggers.
+ *		table has no relevant triggers.  We use tupleDeleted to indicate
+ *		whether the tuple is actually deleted, callers can use it to
+ *		decide whether to continue the operation.  When this DELETE is a
+ *		part of an UPDATE of partition-key, then the slot returned by
+ *		EvalPlanQual() is passed back using output parameter epqslot.
  *
  *		Returns RETURNING result if any, otherwise NULL.
  * ----------------------------------------------------------------
@@ -621,10 +625,11 @@ ExecDelete(ModifyTableState *mtstate,
 		   TupleTableSlot *planSlot,
 		   EPQState *epqstate,
 		   EState *estate,
-		   bool *tupleDeleted,
 		   bool processReturning,
 		   bool canSetTag,
-		   bool changingPart)
+		   bool changingPart,
+		   bool *tupleDeleted,
+		   TupleTableSlot **epqslot)
 {
 	ResultRelInfo *resultRelInfo;
 	Relation	resultRelationDesc;
@@ -649,7 +654,7 @@ ExecDelete(ModifyTableState *mtstate,
 		bool		dodelete;
 
 		dodelete = ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										tupleid, oldtuple);
+										tupleid, oldtuple, epqslot);
 
 		if (!dodelete)			/* "do nothing" */
 			return NULL;
@@ -769,19 +774,30 @@ ldelete:;
 
 				if (!ItemPointerEquals(tupleid, &hufd.ctid))
 				{
-					TupleTableSlot *epqslot;
-
-					epqslot = EvalPlanQual(estate,
-										   epqstate,
-										   resultRelationDesc,
-										   resultRelInfo->ri_RangeTableIndex,
-										   LockTupleExclusive,
-										   &hufd.ctid,
-										   hufd.xmax);
-					if (!TupIsNull(epqslot))
+					TupleTableSlot *my_epqslot;
+
+					my_epqslot = EvalPlanQual(estate,
+											  epqstate,
+											  resultRelationDesc,
+											  resultRelInfo->ri_RangeTableIndex,
+											  LockTupleExclusive,
+											  &hufd.ctid,
+											  hufd.xmax);
+					if (!TupIsNull(my_epqslot))
 					{
 						*tupleid = hufd.ctid;
-						goto ldelete;
+
+						/*
+						 * If requested, skip delete and pass back the updated
+						 * row.
+						 */
+						if (epqslot)
+						{
+							*epqslot = my_epqslot;
+							return NULL;
+						}
+						else
+							goto ldelete;
 					}
 				}
 				/* tuple already deleted; nothing to do */
@@ -1052,6 +1068,7 @@ lreplace:;
 		{
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
+			TupleTableSlot *epqslot = NULL;
 			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
@@ -1081,8 +1098,8 @@ lreplace:;
 			 * processing. We want to return rows from INSERT.
 			 */
 			ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate,
-					   estate, &tuple_deleted, false,
-					   false /* canSetTag */ , true /* changingPart */ );
+					   estate, false, false /* canSetTag */ ,
+					   true /* changingPart */ , &tuple_deleted, &epqslot);
 
 			/*
 			 * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1105,7 +1122,23 @@ lreplace:;
 			 * resurrect it.
 			 */
 			if (!tuple_deleted)
-				return NULL;
+			{
+				/*
+				 * epqslot will be typically NULL.  But when ExecDelete()
+				 * finds that another transaction has concurrently updated the
+				 * same row, it re-fetches the row, skips the delete, and
+				 * epqslot is set to the re-fetched tuple slot. In that case,
+				 * we need to do all the checks again.
+				 */
+				if (TupIsNull(epqslot))
+					return NULL;
+				else
+				{
+					slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
+					tuple = ExecMaterializeSlot(slot);
+					goto lreplace;
+				}
+			}
 
 			/*
 			 * Updates set the transition capture map only when a new subplan
@@ -2136,8 +2169,8 @@ ExecModifyTable(PlanState *pstate)
 			case CMD_DELETE:
 				slot = ExecDelete(node, tupleid, oldtuple, planSlot,
 								  &node->mt_epqstate, estate,
-								  NULL, true, node->canSetTag,
-								  false /* changingPart */ );
+								  true, node->canSetTag,
+								  false /* changingPart */ , NULL, NULL);
 				break;
 			default:
 				elog(ERROR, "unknown operation");
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index a5b8610..1031448 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -206,7 +206,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 					 EPQState *epqstate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
-					 HeapTuple fdw_trigtuple);
+					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot **epqslot);
 extern void ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
 					 ItemPointer tupleid,
diff --git a/src/test/isolation/expected/partition-key-update-4.out b/src/test/isolation/expected/partition-key-update-4.out
new file mode 100644
index 0000000..774a7fa
--- /dev/null
+++ b/src/test/isolation/expected/partition-key-update-4.out
@@ -0,0 +1,60 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s2b s2u1 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u1: UPDATE foo SET b = b || ' update2' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo2           2              ABC update2 update1
+
+starting permutation: s1b s2b s2ut1 s1ut s2c s1c s1st s1stl
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut1: UPDATE footrg SET b = b || ' update2' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg2        2              ABC update2 update1
+step s1stl: SELECT * FROM triglog ORDER BY a;
+a              b              
+
+1              ABC update2 trigger
+
+starting permutation: s1b s2b s2u2 s1u s2c s1c s1s
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2u2: UPDATE foo SET b = 'EFG' WHERE a = 1;
+step s1u: UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1u: <... completed>
+step s1c: COMMIT;
+step s1s: SELECT tableoid::regclass, * FROM foo ORDER BY a;
+tableoid       a              b              
+
+foo1           1              EFG            
+
+starting permutation: s1b s2b s2ut2 s1ut s2c s1c s1st s1stl
+step s1b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2b: BEGIN ISOLATION LEVEL READ COMMITTED;
+step s2ut2: UPDATE footrg SET b = 'EFG' WHERE a = 1;
+step s1ut: UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; <waiting ...>
+step s2c: COMMIT;
+step s1ut: <... completed>
+step s1c: COMMIT;
+step s1st: SELECT tableoid::regclass, * FROM footrg ORDER BY a;
+tableoid       a              b              
+
+footrg1        1              EFG            
+step s1stl: SELECT * FROM triglog ORDER BY a;
+a              b              
+
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 0e99721..d5594e8 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -74,4 +74,5 @@ test: predicate-gin-nomatch
 test: partition-key-update-1
 test: partition-key-update-2
 test: partition-key-update-3
+test: partition-key-update-4
 test: plpgsql-toast
diff --git a/src/test/isolation/specs/partition-key-update-4.spec b/src/test/isolation/specs/partition-key-update-4.spec
new file mode 100644
index 0000000..1d53a7d
--- /dev/null
+++ b/src/test/isolation/specs/partition-key-update-4.spec
@@ -0,0 +1,76 @@
+# Test that a row that ends up in a new partition contains changes made by
+# a concurrent transaction.
+
+setup
+{
+  --
+  -- Setup to test concurrent handling of ExecDelete().
+  --
+  CREATE TABLE foo (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
+  CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);
+  INSERT INTO foo VALUES (1, 'ABC');
+
+  --
+  -- Setup to test concurrent handling of GetTupleForTrigger().
+  --
+  CREATE TABLE footrg (a int, b text) PARTITION BY LIST(a);
+  CREATE TABLE triglog as select * from footrg;
+  CREATE TABLE footrg1 PARTITION OF footrg FOR VALUES IN (1);
+  CREATE TABLE footrg2 PARTITION OF footrg FOR VALUES IN (2);
+  INSERT INTO footrg VALUES (1, 'ABC');
+  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
+  BEGIN
+	 OLD.b = OLD.b || ' trigger';
+
+	 -- This will verify that the trigger is not run *before* the row is
+	 -- refetched by EvalPlanQual. The OLD row should contain the changes made
+	 -- by the concurrent session.
+     INSERT INTO triglog select OLD.*;
+
+     RETURN OLD;
+  END $$ LANGUAGE PLPGSQL;
+  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
+   FOR EACH ROW EXECUTE PROCEDURE func_footrg();
+
+}
+
+teardown
+{
+  DROP TABLE foo;
+  DROP TRIGGER footrg_ondel ON footrg1;
+  DROP FUNCTION func_footrg();
+  DROP TABLE footrg;
+  DROP TABLE triglog;
+}
+
+session "s1"
+step "s1b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s1u"   { UPDATE foo SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1ut"  { UPDATE footrg SET a = a + 1, b = b || ' update1' WHERE b like '%ABC%'; }
+step "s1s"   { SELECT tableoid::regclass, * FROM foo ORDER BY a; }
+step "s1st"  { SELECT tableoid::regclass, * FROM footrg ORDER BY a; }
+step "s1stl"  { SELECT * FROM triglog ORDER BY a; }
+step "s1c"	 { COMMIT; }
+
+session "s2"
+step "s2b"	 { BEGIN ISOLATION LEVEL READ COMMITTED; }
+step "s2u1"  { UPDATE foo SET b = b || ' update2' WHERE a = 1; }
+step "s2u2"  { UPDATE foo SET b = 'EFG' WHERE a = 1; }
+step "s2ut1" { UPDATE footrg SET b = b || ' update2' WHERE a = 1; }
+step "s2ut2" { UPDATE footrg SET b = 'EFG' WHERE a = 1; }
+step "s2c"	 { COMMIT; }
+
+
+# Session s1 is moving a row into another partition, but is waiting for
+# another session s2 that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by session s2.
+permutation "s1b" "s2b" "s2u1" "s1u" "s2c" "s1c" "s1s"
+
+# Same as above, except, session s1 is waiting in GetTupleTrigger().
+permutation "s1b" "s2b" "s2ut1" "s1ut" "s2c" "s1c" "s1st" "s1stl"
+
+# Below two cases are similar to the above two; except that the session s1
+# fails EvalPlanQual() test, so partition key update does not happen.
+permutation "s1b" "s2b" "s2u2" "s1u" "s2c" "s1c" "s1s"
+permutation "s1b" "s2b" "s2ut2" "s1ut" "s2c" "s1c" "s1st" "s1stl"
-- 
1.8.3.1

#36

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Kapila (#35)

Re: Concurrency bug in UPDATE of partition-key

On 11 July 2018 at 09:48, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 11, 2018 at 8:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 11, 2018 at 5:59 AM, Alvaro Herrera

Please move the output arguments at the end of argument lists;

make sense.

also, it
would be great if you add commentary about ExecDelete other undocumented
arguments (tupleDeleted in particular) while you're in the vicinity.

We already have some commentary in the caller of ExecDelete ("For some
reason if DELETE didn't happen ..."), but I think it will be clear if
we can add some comments atop function ExecDelete. I will send the
updated patch shortly.

Attached, please find an updated patch based on comments by Alvaro.
See, if this looks okay to you guys.

Thanks for the patch. It looks good to me.

-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#37

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Amit Kapila (#35)

Re: Concurrency bug in UPDATE of partition-key

On 2018-Jul-11, Amit Kapila wrote:

Attached, please find an updated patch based on comments by Alvaro.
See, if this looks okay to you guys.

LGTM as far as my previous comments are concerned.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#38

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Alvaro Herrera (#37)

Re: Concurrency bug in UPDATE of partition-key

On 2018-07-11 09:16:33 -0400, Alvaro Herrera wrote:

On 2018-Jul-11, Amit Kapila wrote:

Attached, please find an updated patch based on comments by Alvaro.
See, if this looks okay to you guys.

LGTM as far as my previous comments are concerned.

I see Amit pushed a patch here yesterday
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=40ca70ebcc9d0bc1c02937b27c44b2766617e790
, is there a need to keep the open item open, or is this the complete fix?

Greetings,

Andres Freund

#39

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Andres Freund (#38)

Re: Concurrency bug in UPDATE of partition-key

On Thu, Jul 12, 2018 at 10:14 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-07-11 09:16:33 -0400, Alvaro Herrera wrote:

On 2018-Jul-11, Amit Kapila wrote:

Attached, please find an updated patch based on comments by Alvaro.
See, if this looks okay to you guys.

LGTM as far as my previous comments are concerned.

I see Amit pushed a patch here yesterday
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=40ca70ebcc9d0bc1c02937b27c44b2766617e790
, is there a need to keep the open item open,

No, I have moved the item from the open issues list. I was waiting
for the build farm, yesterday, it has shown some failure after this
commit, but it turns out to be some unrelated random failure.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com