default partition and concurrent attach partition

Started by Amit Langoteover 5 years ago16 messages
#1Amit Langote
amitlangote09@gmail.com
1 attachment(s)

Hi,

Starting a new thread to discuss a bug related to $subject that Hao Wu
reported on thread titled "ALTER TABLE .. DETACH PARTITION
CONCURRENTLY" [1]/messages/by-id/DM5PR0501MB3910E97A9EDFB4C775CF3D75A42F0@DM5PR0501MB3910.namprd05.prod.outlook.com. I have been able to reproduce the bug using steps
that Hao gave in that email:

create table tpart (i int, j int) partition by range(i);
create table tpart_1 (like tpart);
create table tpart_2 (like tpart);
create table tpart_default (like tpart);
alter table tpart attach partition tpart_1 for values from (0) to (100);
alter table tpart attach partition tpart_default default;
insert into tpart_2 values (110,110), (120,120), (150,150);

Session 1:

begin;
alter table tpart attach partition tpart_2 for values from (100) to (200);

Session 2:

begin;
insert into tpart values (110,110), (120,120), (150,150);
<blocks waiting for the concurrent attach to finish>

Session 1:

end;

Session 2:

select tableoid::regclass, * from tpart;
end;

The select will show that rows inserted by session 2 are inserted into
tpart_default, whereas after successfully attaching tpart_2, they do
not actually belong there.

The problem is that when session 2 inserts those rows into tpart, it
only knows about 2 partitions: tpart_1, tpart_default, of which it
selects tpart_default to insert those rows into. When tpart_default
is locked to perform the insert, it waits for session 1 to release the
lock taken on tpart_default during the attach command. When it is
unblocked, it proceeds to finish the insert without rechecking the
partition constraint which would have been updated as result of a new
partition having been added to the parent.

Note that we don't normally check the partition constraint when
inserting a row into a partition if the insert occurs via tuple
routing, which makes sense for non-default partitions whose partition
constraint cannot change due to concurrent activity. But this test
case has shown that the assumption is not safe for a default partition
whose constraint is a function of other partitions that exist as of
when the insert occurs.

By the way, if you reverse the order of operations between session 1
and 2 such that the insert by session 2 occurs first and then the
attach by session 1, then you will correctly get this error from the
attach command:

ERROR: updated partition constraint for default partition
"tpart_default" would be violated by some row

Attached is a patch to fix things on the insert side.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1]: /messages/by-id/DM5PR0501MB3910E97A9EDFB4C775CF3D75A42F0@DM5PR0501MB3910.namprd05.prod.outlook.com

Attachments:

v1-0001-Check-default-partition-s-constraint-even-after-t.patchapplication/octet-stream; name=v1-0001-Check-default-partition-s-constraint-even-after-t.patchDownload
From 34e5992d122998b756b631ebf689267b03032504 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Thu, 3 Sep 2020 15:53:04 +0900
Subject: [PATCH v1] Check default partition's constraint even after tuple
 routing

A partition's constraint is not checked if a row has been inserted
into it via tuple routing, because the routing process ensures that
only the tuples satisfying the constraint make it to a given
partition.  For a default partition however, whose constraint can
change on-the-fly due to concurrent addition of partitions, this
assumption does not always hold.  So, check the constraint even
when inserting into a default partition via tuple routing.

This also adds an isolation test based on the reproduction steps
described by Hao Wu.

Reported by: Hao Wu <hawu@vmware.com>
---
 src/backend/executor/execPartition.c               | 62 ++++++++++++++++++++--
 .../expected/partition-concurrent-attach.out       | 21 ++++++++
 src/test/isolation/isolation_schedule              |  1 +
 .../specs/partition-concurrent-attach.spec         | 34 ++++++++++++
 4 files changed, 113 insertions(+), 5 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-concurrent-attach.out
 create mode 100644 src/test/isolation/specs/partition-concurrent-attach.spec

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 79fcbd6..e231b32 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -51,6 +51,13 @@
  *		PartitionDispatchData->indexes for details on how this array is
  *		indexed.
  *
+ * nonleaf_partitions
+ * 		Array of 'max_dispatch' elements containing pointers to
+ * 		ResultRelInfos of nonleaf partitions touched by tuple routing.  These
+ * 		are currently only used for checking the partition's constraint if
+ * 		the nonleaf partition happens to be a default partition of its
+ * 		parent
+ *
  * num_dispatch
  *		The current number of items stored in the 'partition_dispatch_info'
  *		array.  Also serves as the index of the next free array element for
@@ -89,6 +96,7 @@ struct PartitionTupleRouting
 {
 	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
+	ResultRelInfo **nonleaf_partitions;
 	int			num_dispatch;
 	int			max_dispatch;
 	ResultRelInfo **partitions;
@@ -283,6 +291,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
 	MemoryContext oldcxt;
+	ResultRelInfo *rri = NULL;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -352,8 +361,6 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		if (partdesc->is_leaf[partidx])
 		{
-			ResultRelInfo *rri;
-
 			/*
 			 * Look to see if we've already got a ResultRelInfo for this
 			 * partition.
@@ -405,9 +412,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 			if (slot == myslot)
 				ExecClearTuple(myslot);
 
-			MemoryContextSwitchTo(oldcxt);
-			ecxt->ecxt_scantuple = ecxt_scantuple_old;
-			return rri;
+			/* We've reached the leaf. */
+			dispatch = NULL;
 		}
 		else
 		{
@@ -419,6 +425,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				/* Already built. */
 				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
 
+				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
+
 				/*
 				 * Move down to the next partition level and search again
 				 * until we find a leaf partition that matches this tuple
@@ -440,10 +448,37 @@ ExecFindPartition(ModifyTableState *mtstate,
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
 					   dispatch->indexes[partidx] < proute->num_dispatch);
+
+				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
 				dispatch = subdispatch;
 			}
 		}
+
+		/*
+		 * If this partition is the default one, we must check its partition
+		 * constraint, because it may have changed due to partitions being
+		 * added to the parent concurrently.  We do the check here instead of
+		 * in ExecInsert(), because we would not want to miss checking the
+		 * constraint of any nonleaf partitions as they never make it to
+		 * ExecInsert().
+		 */
+		if (partidx == partdesc->boundinfo->default_index)
+		{
+			Assert(rri != NULL);
+			ExecPartitionCheck(rri, slot, estate, true);
+		}
+
+		/* Break out and return if we've found the leaf partition. */
+		if (dispatch == NULL)
+			break;
 	}
+
+	MemoryContextSwitchTo(oldcxt);
+	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+
+	Assert(rri != NULL);
+
+	return rri;
 }
 
 /*
@@ -992,6 +1027,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	Relation	rel;
 	PartitionDesc partdesc;
 	PartitionDispatch pd;
+	ResultRelInfo *rri = NULL;
 	int			dispatchidx;
 	MemoryContext oldcxt;
 
@@ -1060,6 +1096,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 			proute->max_dispatch = 4;
 			proute->partition_dispatch_info = (PartitionDispatch *)
 				palloc(sizeof(PartitionDispatch) * proute->max_dispatch);
+			proute->nonleaf_partitions = (ResultRelInfo **)
+				palloc(sizeof(ResultRelInfo *) * proute->max_dispatch);
 		}
 		else
 		{
@@ -1067,6 +1105,9 @@ ExecInitPartitionDispatchInfo(EState *estate,
 			proute->partition_dispatch_info = (PartitionDispatch *)
 				repalloc(proute->partition_dispatch_info,
 						 sizeof(PartitionDispatch) * proute->max_dispatch);
+			proute->nonleaf_partitions = (ResultRelInfo **)
+				repalloc(proute->nonleaf_partitions,
+						 sizeof(ResultRelInfo *) * proute->max_dispatch);
 		}
 	}
 	proute->partition_dispatch_info[dispatchidx] = pd;
@@ -1081,6 +1122,17 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		parent_pd->indexes[partidx] = dispatchidx;
 	}
 
+	/*
+	 * Also create a minimally valid ResultRelInfo to check the partition's
+	 * partition's constraint.
+	 */
+	if (rel->rd_rel->relispartition)
+	{
+		rri = makeNode(ResultRelInfo);
+		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
+	}
+	proute->nonleaf_partitions[dispatchidx] = rri;
+
 	MemoryContextSwitchTo(oldcxt);
 
 	return pd;
diff --git a/src/test/isolation/expected/partition-concurrent-attach.out b/src/test/isolation/expected/partition-concurrent-attach.out
new file mode 100644
index 0000000..553bb44
--- /dev/null
+++ b/src/test/isolation/expected/partition-concurrent-attach.out
@@ -0,0 +1,21 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s1a s2b s2i s1c s2c
+step s1b: begin;
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200);
+step s2b: begin;
+step s2i: insert into tpart values (110,110),(120,120),(150,150); <waiting ...>
+step s1c: commit;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  new row for relation "tpart_default" violates partition constraint
+step s2c: commit;
+
+starting permutation: s1b s2b s2i s1a s2c s1c
+step s1b: begin;
+step s2b: begin;
+step s2i: insert into tpart values (110,110),(120,120),(150,150);
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200); <waiting ...>
+step s2c: commit;
+step s1a: <... completed>
+error in steps s2c s1a: ERROR:  updated partition constraint for default partition "tpart_default" would be violated by some row
+step s1c: commit;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 218c87b..56c08fd 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -88,3 +88,4 @@ test: plpgsql-toast
 test: truncate-conflict
 test: serializable-parallel
 test: serializable-parallel-2
+test: partition-concurrent-attach
diff --git a/src/test/isolation/specs/partition-concurrent-attach.spec b/src/test/isolation/specs/partition-concurrent-attach.spec
new file mode 100644
index 0000000..5290e8a
--- /dev/null
+++ b/src/test/isolation/specs/partition-concurrent-attach.spec
@@ -0,0 +1,34 @@
+# Verify that default partition constraint is enforced correctly
+# in light of partitions being added concurrently to its parent
+setup {
+drop table if exists tpart;
+  create table tpart(i int, j int) partition by range(i);
+  create table tpart_1(like tpart);
+  create table tpart_2(like tpart);
+  create table tpart_default(like tpart);
+  alter table tpart attach partition tpart_1 for values from(0) to (100);
+  alter table tpart attach partition tpart_default default;
+  insert into tpart_2 values(110,110),(120,120),(150,150);
+}
+
+session "s1"
+step "s1b"	{	begin; }
+step "s1a"	{	alter table tpart attach partition tpart_2 for values from (100) to (200); }
+step "s1c"	{	commit; }
+
+session "s2"
+step "s2b"	{	begin; }
+step "s2i"	{	insert into tpart values (110,110),(120,120),(150,150); }
+step "s2c"	{	commit; }
+
+teardown	{	drop table tpart; }
+
+# insert into tpart by s2 which routes to tpart_default due to not seeing
+# concurrently added tpart_2 should fail, because the partition constraint
+# of tpart_default would have changed due to tpart_2 having been added
+permutation "s1b" "s1a" "s2b" "s2i" "s1c" "s2c"
+
+# reverse: now the insert into tpart_default by s2 occurs first followed by
+# attach in s1, which should fail when it scans the default partitions to
+# find the violating rows
+permutation "s1b" "s2b" "s2i" "s1a" "s2c" "s1c"
-- 
1.8.3.1

#2Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#1)
Re: default partition and concurrent attach partition

On Thu, Sep 3, 2020 at 6:50 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi,

Starting a new thread to discuss a bug related to $subject that Hao Wu
reported on thread titled "ALTER TABLE .. DETACH PARTITION
CONCURRENTLY" [1]. I have been able to reproduce the bug using steps
that Hao gave in that email:

create table tpart (i int, j int) partition by range(i);
create table tpart_1 (like tpart);
create table tpart_2 (like tpart);
create table tpart_default (like tpart);
alter table tpart attach partition tpart_1 for values from (0) to (100);
alter table tpart attach partition tpart_default default;
insert into tpart_2 values (110,110), (120,120), (150,150);

Session 1:

begin;
alter table tpart attach partition tpart_2 for values from (100) to (200);

Session 2:

begin;
insert into tpart values (110,110), (120,120), (150,150);
<blocks waiting for the concurrent attach to finish>

Session 1:

end;

Session 2:

select tableoid::regclass, * from tpart;
end;

The select will show that rows inserted by session 2 are inserted into
tpart_default, whereas after successfully attaching tpart_2, they do
not actually belong there.

The problem is that when session 2 inserts those rows into tpart, it
only knows about 2 partitions: tpart_1, tpart_default, of which it
selects tpart_default to insert those rows into. When tpart_default
is locked to perform the insert, it waits for session 1 to release the
lock taken on tpart_default during the attach command. When it is
unblocked, it proceeds to finish the insert without rechecking the
partition constraint which would have been updated as result of a new
partition having been added to the parent.

Note that we don't normally check the partition constraint when
inserting a row into a partition if the insert occurs via tuple
routing, which makes sense for non-default partitions whose partition
constraint cannot change due to concurrent activity. But this test
case has shown that the assumption is not safe for a default partition
whose constraint is a function of other partitions that exist as of
when the insert occurs.

By the way, if you reverse the order of operations between session 1
and 2 such that the insert by session 2 occurs first and then the
attach by session 1, then you will correctly get this error from the
attach command:

ERROR: updated partition constraint for default partition
"tpart_default" would be violated by some row

Attached is a patch to fix things on the insert side.

Forgot to mention that the problem exists as of v12 (commit: 898e5e329).

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#1)
1 attachment(s)
Re: default partition and concurrent attach partition

Thanks for this fix! Looking into it now.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

fixestext/plain; charset=us-asciiDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 4d34734a45..fe42670e0a 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -52,11 +52,9 @@
  *		indexed.
  *
  * nonleaf_partitions
- * 		Array of 'max_dispatch' elements containing pointers to
- * 		ResultRelInfos of nonleaf partitions touched by tuple routing.  These
- * 		are currently only used for checking the partition's constraint if
- * 		the nonleaf partition happens to be a default partition of its
- * 		parent
+ *		Array of 'max_dispatch' elements containing pointers to fake
+ *		ResultRelInfo objects for nonleaf partitions, useful for checking
+ *		the partition constraint.
  *
  * num_dispatch
  *		The current number of items stored in the 'partition_dispatch_info'
@@ -305,7 +303,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
-	while (true)
+	do
 	{
 		AttrNumber *map = dispatch->tupmap;
 		int			partidx = -1;
@@ -455,22 +453,12 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		/*
 		 * If this partition is the default one, we must check its partition
-		 * constraint, because it may have changed due to partitions being
-		 * added to the parent concurrently.  We do the check here instead of
-		 * in ExecInsert(), because we would not want to miss checking the
-		 * constraint of any nonleaf partitions as they never make it to
-		 * ExecInsert().
+		 * constraint now, which may have changed due to partitions being
+		 * added to the parent concurrently.
 		 */
 		if (partidx == partdesc->boundinfo->default_index)
-		{
-			Assert(rri != NULL);
 			ExecPartitionCheck(rri, slot, estate, true);
-		}
-
-		/* Break out and return if we've found the leaf partition. */
-		if (dispatch == NULL)
-			break;
-	}
+	} while (dispatch != NULL);
 
 	MemoryContextSwitchTo(oldcxt);
 	ecxt->ecxt_scantuple = ecxt_scantuple_old;
@@ -1037,7 +1025,6 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	Relation	rel;
 	PartitionDesc partdesc;
 	PartitionDispatch pd;
-	ResultRelInfo *rri = NULL;
 	int			dispatchidx;
 	MemoryContext oldcxt;
 
@@ -1123,6 +1110,19 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	}
 	proute->partition_dispatch_info[dispatchidx] = pd;
 
+	/*
+	 * If setting up a PartitionDispatch for a sub-partitioned table, we may
+	 * also need a fake ResultRelInfo for checking the partition constraint
+	 * later; set that up now.
+	 */
+	if (parent_pd)
+	{
+		ResultRelInfo *rri = makeNode(ResultRelInfo);
+
+		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
+		proute->nonleaf_partitions[dispatchidx] = rri;
+	}
+
 	/*
 	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
 	 * install a downlink in the parent to allow quick descent.
@@ -1133,17 +1133,6 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		parent_pd->indexes[partidx] = dispatchidx;
 	}
 
-	/*
-	 * Also create a minimally valid ResultRelInfo to check the partition's
-	 * partition's constraint.
-	 */
-	if (rel->rd_rel->relispartition)
-	{
-		rri = makeNode(ResultRelInfo);
-		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
-	}
-	proute->nonleaf_partitions[dispatchidx] = rri;
-
 	MemoryContextSwitchTo(oldcxt);
 
 	return pd;
#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#3)
Re: default partition and concurrent attach partition

On 2020-Sep-03, Alvaro Herrera wrote:

+	/*
+	 * If setting up a PartitionDispatch for a sub-partitioned table, we may
+	 * also need a fake ResultRelInfo for checking the partition constraint
+	 * later; set that up now.
+	 */
+	if (parent_pd)
+	{
+		ResultRelInfo *rri = makeNode(ResultRelInfo);
+
+		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
+		proute->nonleaf_partitions[dispatchidx] = rri;
+	}
+

Heh, I had intended to remove the attachment before sending, because I
thought I was seeing a problem with this proposed coding of mine. But
since I sent it by mistake, I'll explain: I think this will fail if we
have a partitioned default partition, and we direct the insert to it
directly while attaching a partition to its parent. I think that kind
of scenario deserves its own test case.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#1)
Re: default partition and concurrent attach partition

Also, I should have pointed out that ExecInsert doesn't actually check
the partitin constraint except in very specific cases; what it does is
expect that the partition routing code got it right. So the comment
you're adding about that is wrong, and it did misled me into changing
your code in a way that actually caused a bug -- hence my proposed
rewording.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Amit Langote
amitlangote09@gmail.com
In reply to: Alvaro Herrera (#5)
Re: default partition and concurrent attach partition

Hi Alvaro,

On Fri, Sep 4, 2020 at 6:28 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Also, I should have pointed out that ExecInsert doesn't actually check
the partitin constraint except in very specific cases; what it does is
expect that the partition routing code got it right. So the comment
you're adding about that is wrong, and it did misled me into changing
your code in a way that actually caused a bug -- hence my proposed
rewording.

Thanks for taking a look.

        /*
         * If this partition is the default one, we must check its partition
-        * constraint, because it may have changed due to partitions being
-        * added to the parent concurrently.  We do the check here instead of
-        * in ExecInsert(), because we would not want to miss checking the
-        * constraint of any nonleaf partitions as they never make it to
-        * ExecInsert().
+        * constraint now, which may have changed due to partitions being
+        * added to the parent concurrently.

I am fine with your rewording but let me explain how I ended up
writing what I wrote:

At first I had pondered tweaking the following code in ExecInsert() to
fix this bug:

/*
* Also check the tuple against the partition constraint, if there is
* one; except that if we got here via tuple-routing, we don't need to
* if there's no BR trigger defined on the partition.
*/
if (resultRelInfo->ri_PartitionCheck &&
(resultRelInfo->ri_PartitionRoot == NULL ||
(resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_insert_before_row)))
ExecPartitionCheck(resultRelInfo, slot, estate, true);

I gave up because I concluded that there isn't enough information at
this place in code to determine if the partition is a default
partition, so I moved my attention to ExecFindPartition() where we
have access to the parent's PartitionDesc which is enough to do so.
In the process of modifying ExecFindPartition() to fix the bug I
realized that default partitions can be partitioned and
sub-partitioned partitions never reach ExecInsert(). So, beside the
earlier mentioned inconvenience of fixing this bug in ExecInsert(),
there is also the problem that such a fix would have missed addressing
sub-partitioned default partitions. I thought someone beside me would
also wonder why ExecInsert() is not touched in this fix, hence the
comment.

FWIW, I still prefer "minimally valid ResultRelInfo" to "fake
ResultRelInfo", because those aren't really fake, are they? They are
as valid as any other ResultRelInfo as far I can see. I said
"minimally valid" because a fully-valid partition ResultRelInfo is one
made by ExecInitPartitionInfo(), which I avoided to use here thinking
that would be overkill.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#6)
1 attachment(s)
Re: default partition and concurrent attach partition

On 2020-Sep-04, Amit Langote wrote:

Hello

FWIW, I still prefer "minimally valid ResultRelInfo" to "fake
ResultRelInfo", because those aren't really fake, are they? They are
as valid as any other ResultRelInfo as far I can see. I said
"minimally valid" because a fully-valid partition ResultRelInfo is one
made by ExecInitPartitionInfo(), which I avoided to use here thinking
that would be overkill.

Well, they are fake in that the ri_RangeTableIndex they carry is bogus,
which means that ExecBuildSlotValueDescription will misbehave if the
partitioned default partition has a different column order than its
parent. That can be evidenced by changing the setup block in the
isolation test as in the attached; and you'll get an undesirable error
like this:

2020-09-07 19:00:49.513 -03 [12981] ERROR: attribute 2 of type record has wrong type
2020-09-07 19:00:49.513 -03 [12981] DETAIL: Table has type text, but query expects integer.
2020-09-07 19:00:49.513 -03 [12981] STATEMENT: insert into attach_tab values (110, 'eleven and five twenties');

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

ff.spectext/plain; charset=us-asciiDownload
#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#7)
Re: default partition and concurrent attach partition

On 2020-Sep-07, Alvaro Herrera wrote:

Well, they are fake in that the ri_RangeTableIndex they carry is bogus,
which means that ExecBuildSlotValueDescription will misbehave if the
partitioned default partition has a different column order than its
parent. That can be evidenced by changing the setup block in the
isolation test as in the attached; and you'll get an undesirable error
like this:

2020-09-07 19:00:49.513 -03 [12981] ERROR: attribute 2 of type record has wrong type
2020-09-07 19:00:49.513 -03 [12981] DETAIL: Table has type text, but query expects integer.
2020-09-07 19:00:49.513 -03 [12981] STATEMENT: insert into attach_tab values (110, 'eleven and five twenties');

... and I sent before completing. I'm not sure what a good fix for this
is. We could try to initialize the resultRelInfo honestly, or we could
set ri_RangeTableIndex to some invalid value that will ... eh ... do
something else.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#8)
1 attachment(s)
Re: default partition and concurrent attach partition

Ah, it looks like we can get away with initializing the RRI to 0, and
then explicitly handle that case in ExecPartitionCheckEmitError, as in
the attached (which means reindenting, but I left it alone to make it
easy to read). It kinda sucks because we don't report the tuple that
causes the error, but

a) it's a very unlikely case anyway
b) it's better than the bogus error message
c) it's better than some hypothetical crash
d) nobody uses partitioned default partitions anyway
e) nobody uses differing column order anyway

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Avoid-invalid-RRI.patchtext/x-diff; charset=us-asciiDownload
From 7289a0706d95aa621b9d6f626a836ac381fd4f61 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Mon, 7 Sep 2020 19:26:37 -0300
Subject: [PATCH] Avoid invalid RRI

---
 src/backend/executor/execMain.c      | 7 +++++++
 src/backend/executor/execPartition.c | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4fdffad6f3..a0c3e56a03 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1840,6 +1840,12 @@ ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
 	 * back to the root table's rowtype so that val_desc in the error message
 	 * matches the input tuple.
 	 */
+	if (resultRelInfo->ri_RangeTableIndex == 0)
+	{
+		val_desc = NULL;
+	}
+	else
+	{
 	if (resultRelInfo->ri_PartitionRoot)
 	{
 		TupleDesc	old_tupdesc;
@@ -1874,6 +1880,7 @@ ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
 											 tupdesc,
 											 modifiedCols,
 											 64);
+	}
 	ereport(ERROR,
 			(errcode(ERRCODE_CHECK_VIOLATION),
 			 errmsg("new row for relation \"%s\" violates partition constraint",
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 0ca1d34dfa..c1ef34d771 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1111,7 +1111,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	{
 		ResultRelInfo *rri = makeNode(ResultRelInfo);
 
-		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
+		InitResultRelInfo(rri, rel, 0, proute->partition_root, 0);
 		proute->nonleaf_partitions[dispatchidx] = rri;
 	}
 	else
-- 
2.20.1

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#9)
Re: default partition and concurrent attach partition

On 2020-Sep-07, Alvaro Herrera wrote:

Ah, it looks like we can get away with initializing the RRI to 0, and
then explicitly handle that case in ExecPartitionCheckEmitError, as in
the attached (which means reindenting, but I left it alone to make it
easy to read).

Well, that was silly -- this seems fixable by mapping the tuple columns
prior to ExecPartitionCheck, which is achievable with something like

if (partidx == partdesc->boundinfo->default_index)
{
TupleTableSlot *tempslot;

if (dispatch->tupmap != NULL)
{
tempslot = MakeSingleTupleTableSlot(RelationGetDescr(rri->ri_RelationDesc),
&TTSOpsHeapTuple);
tempslot = execute_attr_map_slot(dispatch->tupmap, slot, tempslot);
}
else
tempslot = slot;
ExecPartitionCheck(rri, tempslot, estate, true);
if (dispatch->tupmap != NULL)
ExecDropSingleTupleTableSlot(tempslot);
}

though this exact incantation, apart from being pretty horrible, doesn't
actually work (other than to fix this specific test case -- it crashes
elsewhere.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Amit Langote
amitlangote09@gmail.com
In reply to: Alvaro Herrera (#10)
2 attachment(s)
Re: default partition and concurrent attach partition

Hi Alvaro,

On Tue, Sep 8, 2020 at 8:44 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Sep-07, Alvaro Herrera wrote:

Ah, it looks like we can get away with initializing the RRI to 0, and
then explicitly handle that case in ExecPartitionCheckEmitError, as in
the attached (which means reindenting, but I left it alone to make it
easy to read).

At this point, I think it may be clear that ri_RangeTableIndex being
set to a dummy value isn't problematic. I checked the code paths you
mentioned upthread, but don't see anything that would be broken by
ri_RangeTableIndex being set to a dummy value of 1.

Moving on from that...

Well, that was silly -- this seems fixable by mapping the tuple columns
prior to ExecPartitionCheck, which is achievable with something like

if (partidx == partdesc->boundinfo->default_index)
{
TupleTableSlot *tempslot;

if (dispatch->tupmap != NULL)
{
tempslot = MakeSingleTupleTableSlot(RelationGetDescr(rri->ri_RelationDesc),
&TTSOpsHeapTuple);
tempslot = execute_attr_map_slot(dispatch->tupmap, slot, tempslot);
}
else
tempslot = slot;
ExecPartitionCheck(rri, tempslot, estate, true);
if (dispatch->tupmap != NULL)
ExecDropSingleTupleTableSlot(tempslot);
}

though this exact incantation, apart from being pretty horrible, doesn't
actually work (other than to fix this specific test case -- it crashes
elsewhere.)

Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
TupleDesc matches the partition's. Actually we do have such dedicated
slots for partitions around (for both sub-partitioned and leaf
partitions), so we can simply use them instead of creating one from
scratch for every use. It did take some code rearrangement to
actually make that work though.

Attached is the latest patch including those changes. Also, I have
merged your earlier "fixes" patch and updated the isolation test to
exercise a case with sub-partitioned default partition, as well as
varying attribute numbers. Patch applies as-is to both HEAD and v13
branches, but needs slight changes to apply to v12 branch, so also
attaching one for that branch.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Check-default-partition-s-constraint-even-after-t.patchapplication/octet-stream; name=v2-0001-Check-default-partition-s-constraint-even-after-t.patchDownload
From e8ccff62337f109a0df98df6f2b03e97951f5fe2 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Thu, 3 Sep 2020 15:53:04 +0900
Subject: [PATCH v2] Check default partition's constraint even after tuple
 routing

A partition's constraint is not checked if a row has been inserted
into it via tuple routing, because the routing process ensures that
only the tuples satisfying the constraint make it to a given
partition.  For a default partition however, whose constraint can
change on-the-fly due to concurrent addition of partitions, this
assumption does not always hold.  So, check the constraint even
when inserting into a default partition via tuple routing.

This also adds an isolation test based on the reproduction steps
described by Hao Wu.

Reported by: Hao Wu <hawu@vmware.com>
---
 src/backend/executor/execPartition.c               | 125 +++++++++++++++++----
 .../expected/partition-concurrent-attach.out       |  49 ++++++++
 src/test/isolation/isolation_schedule              |   1 +
 .../specs/partition-concurrent-attach.spec         |  43 +++++++
 4 files changed, 195 insertions(+), 23 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-concurrent-attach.out
 create mode 100644 src/test/isolation/specs/partition-concurrent-attach.spec

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 79fcbd6..5910d0e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -51,6 +51,11 @@
  *		PartitionDispatchData->indexes for details on how this array is
  *		indexed.
  *
+ * nonleaf_partitions
+ *		Array of 'max_dispatch' elements containing pointers to fake
+ *		ResultRelInfo objects for nonleaf partitions, useful for checking
+ *		the partition constraint.
+ *
  * num_dispatch
  *		The current number of items stored in the 'partition_dispatch_info'
  *		array.  Also serves as the index of the next free array element for
@@ -89,6 +94,7 @@ struct PartitionTupleRouting
 {
 	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
+	ResultRelInfo **nonleaf_partitions;
 	int			num_dispatch;
 	int			max_dispatch;
 	ResultRelInfo **partitions;
@@ -281,8 +287,10 @@ ExecFindPartition(ModifyTableState *mtstate,
 	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
+	TupleTableSlot *rootslot = slot;
 	TupleTableSlot *myslot = NULL;
 	MemoryContext oldcxt;
+	ResultRelInfo *rri = NULL;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -296,9 +304,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
-	while (true)
+	do
 	{
-		AttrMap    *map = dispatch->tupmap;
 		int			partidx = -1;
 
 		CHECK_FOR_INTERRUPTS();
@@ -307,17 +314,6 @@ ExecFindPartition(ModifyTableState *mtstate,
 		partdesc = dispatch->partdesc;
 
 		/*
-		 * Convert the tuple to this parent's layout, if different from the
-		 * current relation.
-		 */
-		myslot = dispatch->tupslot;
-		if (myslot != NULL)
-		{
-			Assert(map != NULL);
-			slot = execute_attr_map_slot(map, slot, myslot);
-		}
-
-		/*
 		 * Extract partition key from tuple. Expression evaluation machinery
 		 * that FormPartitionKeyDatum() invokes expects ecxt_scantuple to
 		 * point to the correct tuple slot.  The slot might have changed from
@@ -352,8 +348,6 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		if (partdesc->is_leaf[partidx])
 		{
-			ResultRelInfo *rri;
-
 			/*
 			 * Look to see if we've already got a ResultRelInfo for this
 			 * partition.
@@ -401,13 +395,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 												rootResultRelInfo, partidx);
 			}
 
-			/* Release the tuple in the lowest parent's dedicated slot. */
-			if (slot == myslot)
-				ExecClearTuple(myslot);
-
-			MemoryContextSwitchTo(oldcxt);
-			ecxt->ecxt_scantuple = ecxt_scantuple_old;
-			return rri;
+			/* We've reached the leaf. */
+			dispatch = NULL;
 		}
 		else
 		{
@@ -419,6 +408,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				/* Already built. */
 				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
 
+				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
+
 				/*
 				 * Move down to the next partition level and search again
 				 * until we find a leaf partition that matches this tuple
@@ -440,10 +431,80 @@ ExecFindPartition(ModifyTableState *mtstate,
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
 					   dispatch->indexes[partidx] < proute->num_dispatch);
+
+				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
 				dispatch = subdispatch;
 			}
+
+			/*
+			 * Convert the tuple to the new parent's layout, if different
+			 * from the previous parent.
+			 */
+			if (dispatch->tupslot)
+			{
+				AttrMap    *map = dispatch->tupmap;
+				TupleTableSlot *prev_myslot = myslot;
+
+				myslot = dispatch->tupslot;
+				Assert(map != NULL);
+				slot = execute_attr_map_slot(map, slot, myslot);
+
+				/* Clear previous parent's tuple if any. */
+				if (prev_myslot != NULL)
+					ExecClearTuple(prev_myslot);
+			}
 		}
-	}
+
+		/*
+		 * If this partition is the default one, we must check its partition
+		 * constraint, which, unlike a non-default partition's constraint, can
+		 * change due to partitions being added to the parent concurrently.
+		 *
+		 * The tuple must match the partition's layout for the constraint
+		 * expression to be evaluated successfully.  If the partition is
+		 * sub-partitioned, that would already be the case due to the code
+		 * above, but for a leaf partition the tuple still matches the
+		 * parent's layout.
+		 */
+		if (partidx == partdesc->boundinfo->default_index)
+		{
+			PartitionRoutingInfo *partrouteinfo = rri->ri_PartitionInfo;
+
+			if (partrouteinfo)
+			{
+				TupleConversionMap *map = partrouteinfo->pi_RootToPartitionMap;
+				TupleTableSlot *pslot = partrouteinfo->pi_PartitionTupleSlot;
+
+				/*
+				 * As the name says, we can only convert from root table's
+				 * layout, so use the slot received from the caller, rootslot.
+				 */
+				if (map)
+					slot = execute_attr_map_slot(map->attrMap, rootslot,
+												 pslot);
+				else
+					/*
+					 * Or use root layout slot as-is.  Note we don't use
+					 * the parent's slot, because we don't know if it's same
+					 * layout or different from the leaf partition, but we
+					 * know for sure that root and leaf are the same.
+					 */
+					slot = rootslot;
+			}
+
+			ExecPartitionCheck(rri, slot, estate, true);
+		}
+	} while (dispatch != NULL);
+
+	/* Release the tuple in the lowest parent's dedicated slot. */
+	if (myslot != NULL)
+		ExecClearTuple(myslot);
+	MemoryContextSwitchTo(oldcxt);
+	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+
+	Assert(rri != NULL);
+
+	return rri;
 }
 
 /*
@@ -1060,6 +1121,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 			proute->max_dispatch = 4;
 			proute->partition_dispatch_info = (PartitionDispatch *)
 				palloc(sizeof(PartitionDispatch) * proute->max_dispatch);
+			proute->nonleaf_partitions = (ResultRelInfo **)
+				palloc(sizeof(ResultRelInfo *) * proute->max_dispatch);
 		}
 		else
 		{
@@ -1067,11 +1130,27 @@ ExecInitPartitionDispatchInfo(EState *estate,
 			proute->partition_dispatch_info = (PartitionDispatch *)
 				repalloc(proute->partition_dispatch_info,
 						 sizeof(PartitionDispatch) * proute->max_dispatch);
+			proute->nonleaf_partitions = (ResultRelInfo **)
+				repalloc(proute->nonleaf_partitions,
+						 sizeof(ResultRelInfo *) * proute->max_dispatch);
 		}
 	}
 	proute->partition_dispatch_info[dispatchidx] = pd;
 
 	/*
+	 * If setting up a PartitionDispatch for a sub-partitioned table, we may
+	 * also need a fake ResultRelInfo for checking the partition constraint
+	 * later; set that up now.
+	 */
+	if (parent_pd)
+	{
+		ResultRelInfo *rri = makeNode(ResultRelInfo);
+
+		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
+		proute->nonleaf_partitions[dispatchidx] = rri;
+	}
+
+	/*
 	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
 	 * install a downlink in the parent to allow quick descent.
 	 */
diff --git a/src/test/isolation/expected/partition-concurrent-attach.out b/src/test/isolation/expected/partition-concurrent-attach.out
new file mode 100644
index 0000000..17fac39
--- /dev/null
+++ b/src/test/isolation/expected/partition-concurrent-attach.out
@@ -0,0 +1,49 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s1a s2b s2i s1c s2c s2s
+step s1b: begin;
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200);
+step s2b: begin;
+step s2i: insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz'); <waiting ...>
+step s1c: commit;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  new row for relation "tpart_default" violates partition constraint
+step s2c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_2        110            xxx            
+tpart_2        120            yyy            
+tpart_2        150            zzz            
+
+starting permutation: s1b s1a s2b s2i2 s1c s2c s2s
+step s1b: begin;
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200);
+step s2b: begin;
+step s2i2: insert into tpart_default (i, j) values (110, 'xxx'), (120, 'yyy'), (150, 'zzz'); <waiting ...>
+step s1c: commit;
+step s2i2: <... completed>
+error in steps s1c s2i2: ERROR:  new row for relation "tpart_default" violates partition constraint
+step s2c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_2        110            xxx            
+tpart_2        120            yyy            
+tpart_2        150            zzz            
+
+starting permutation: s1b s2b s2i s1a s2c s1c s2s
+step s1b: begin;
+step s2b: begin;
+step s2i: insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz');
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200); <waiting ...>
+step s2c: commit;
+step s1a: <... completed>
+error in steps s2c s1a: ERROR:  updated partition constraint for default partition "tpart_default_default" would be violated by some row
+step s1c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_default_default110            xxx            
+tpart_default_default120            yyy            
+tpart_default_default150            zzz            
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 65d1443..f838276 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -90,3 +90,4 @@ test: plpgsql-toast
 test: truncate-conflict
 test: serializable-parallel
 test: serializable-parallel-2
+test: partition-concurrent-attach
diff --git a/src/test/isolation/specs/partition-concurrent-attach.spec b/src/test/isolation/specs/partition-concurrent-attach.spec
new file mode 100644
index 0000000..48c3f83
--- /dev/null
+++ b/src/test/isolation/specs/partition-concurrent-attach.spec
@@ -0,0 +1,43 @@
+# Verify that default partition constraint is enforced correctly
+# in light of partitions being added concurrently to its parent
+setup {
+  drop table if exists tpart;
+  create table tpart(i int, j text) partition by range(i);
+  create table tpart_1(like tpart);
+  create table tpart_2(like tpart);
+  create table tpart_default (a int, j text, i int) partition by list (j);
+  create table tpart_default_default (a int, i int, b int, j text);
+  alter table tpart_default_default drop b;
+  alter table tpart_default attach partition tpart_default_default default;
+  alter table tpart_default drop a;
+  alter table tpart attach partition tpart_default default;
+  alter table tpart attach partition tpart_1 for values from(0) to (100);
+  insert into tpart_2 values (110,'xxx'), (120, 'yyy'), (150, 'zzz');
+}
+
+session "s1"
+step "s1b"	{	begin; }
+step "s1a"	{	alter table tpart attach partition tpart_2 for values from (100) to (200); }
+step "s1c"	{	commit; }
+
+session "s2"
+step "s2b"	{	begin; }
+step "s2i"	{	insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz'); }
+step "s2i2"	{	insert into tpart_default (i, j) values (110, 'xxx'), (120, 'yyy'), (150, 'zzz'); }
+step "s2c"	{	commit; }
+step "s2s"	{	select tableoid::regclass, * from tpart; }
+
+teardown	{	drop table tpart; }
+
+# insert into tpart by s2 which routes to tpart_default due to not seeing
+# concurrently added tpart_2 should fail, because the partition constraint
+# of tpart_default would have changed due to tpart_2 having been added
+permutation "s1b" "s1a" "s2b" "s2i" "s1c" "s2c" "s2s"
+
+# similar to above, but now insert into sub-partitioned tpart_default
+permutation "s1b" "s1a" "s2b" "s2i2" "s1c" "s2c" "s2s"
+
+# reverse: now the insert into tpart_default by s2 occurs first followed by
+# attach in s1, which should fail when it scans the leaf default partition
+# find the violating rows
+permutation "s1b" "s2b" "s2i" "s1a" "s2c" "s1c" "s2s"
-- 
1.8.3.1

v2-0001-Check-default-partition-s-constraint-even-after-t_v12.patchapplication/octet-stream; name=v2-0001-Check-default-partition-s-constraint-even-after-t_v12.patchDownload
From 1cb36346098b1298daa09f621b65dbe1e869ba44 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Thu, 3 Sep 2020 15:53:04 +0900
Subject: [PATCH v2] Check default partition's constraint even after tuple
 routing

A partition's constraint is not checked if a row has been inserted
into it via tuple routing, because the routing process ensures that
only the tuples satisfying the constraint make it to a given
partition.  For a default partition however, whose constraint can
change on-the-fly due to concurrent addition of partitions, this
assumption does not always hold.  So, check the constraint even
when inserting into a default partition via tuple routing.

This also adds an isolation test based on the reproduction steps
described by Hao Wu.

Reported by: Hao Wu <hawu@vmware.com>
---
 src/backend/executor/execPartition.c               | 125 +++++++++++++++++----
 .../expected/partition-concurrent-attach.out       |  49 ++++++++
 src/test/isolation/isolation_schedule              |   1 +
 .../specs/partition-concurrent-attach.spec         |  43 +++++++
 4 files changed, 195 insertions(+), 23 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-concurrent-attach.out
 create mode 100644 src/test/isolation/specs/partition-concurrent-attach.spec

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index b0594e8..f8c250c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -51,6 +51,11 @@
  *		PartitionDispatchData->indexes for details on how this array is
  *		indexed.
  *
+ * nonleaf_partitions
+ *		Array of 'max_dispatch' elements containing pointers to fake
+ *		ResultRelInfo objects for nonleaf partitions, useful for checking
+ *		the partition constraint.
+ *
  * num_dispatch
  *		The current number of items stored in the 'partition_dispatch_info'
  *		array.  Also serves as the index of the next free array element for
@@ -89,6 +94,7 @@ struct PartitionTupleRouting
 {
 	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
+	ResultRelInfo **nonleaf_partitions;
 	int			num_dispatch;
 	int			max_dispatch;
 	ResultRelInfo **partitions;
@@ -281,8 +287,10 @@ ExecFindPartition(ModifyTableState *mtstate,
 	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
+	TupleTableSlot *rootslot = slot;
 	TupleTableSlot *myslot = NULL;
 	MemoryContext oldcxt;
+	ResultRelInfo *rri = NULL;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -296,9 +304,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
-	while (true)
+	do
 	{
-		AttrNumber *map = dispatch->tupmap;
 		int			partidx = -1;
 
 		CHECK_FOR_INTERRUPTS();
@@ -307,17 +314,6 @@ ExecFindPartition(ModifyTableState *mtstate,
 		partdesc = dispatch->partdesc;
 
 		/*
-		 * Convert the tuple to this parent's layout, if different from the
-		 * current relation.
-		 */
-		myslot = dispatch->tupslot;
-		if (myslot != NULL)
-		{
-			Assert(map != NULL);
-			slot = execute_attr_map_slot(map, slot, myslot);
-		}
-
-		/*
 		 * Extract partition key from tuple. Expression evaluation machinery
 		 * that FormPartitionKeyDatum() invokes expects ecxt_scantuple to
 		 * point to the correct tuple slot.  The slot might have changed from
@@ -351,8 +347,6 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		if (partdesc->is_leaf[partidx])
 		{
-			ResultRelInfo *rri;
-
 			/*
 			 * Look to see if we've already got a ResultRelInfo for this
 			 * partition.
@@ -400,13 +394,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 												rootResultRelInfo, partidx);
 			}
 
-			/* Release the tuple in the lowest parent's dedicated slot. */
-			if (slot == myslot)
-				ExecClearTuple(myslot);
-
-			MemoryContextSwitchTo(oldcxt);
-			ecxt->ecxt_scantuple = ecxt_scantuple_old;
-			return rri;
+			/* We've reached the leaf. */
+			dispatch = NULL;
 		}
 		else
 		{
@@ -418,6 +407,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				/* Already built. */
 				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
 
+				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
+
 				/*
 				 * Move down to the next partition level and search again
 				 * until we find a leaf partition that matches this tuple
@@ -439,10 +430,80 @@ ExecFindPartition(ModifyTableState *mtstate,
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
 					   dispatch->indexes[partidx] < proute->num_dispatch);
+
+				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
 				dispatch = subdispatch;
 			}
+
+			/*
+			 * Convert the tuple to the new parent's layout, if different
+			 * from the previous parent.
+			 */
+			if (dispatch->tupslot)
+			{
+				AttrNumber *map = dispatch->tupmap;
+				TupleTableSlot *prev_myslot = myslot;
+
+				myslot = dispatch->tupslot;
+				Assert(map != NULL);
+				slot = execute_attr_map_slot(map, slot, myslot);
+
+				/* Clear previous parent's tuple if any. */
+				if (prev_myslot != NULL)
+					ExecClearTuple(prev_myslot);
+			}
 		}
-	}
+
+		/*
+		 * If this partition is the default one, we must check its partition
+		 * constraint, which, unlike a non-default partition's constraint, can
+		 * change due to partitions being added to the parent concurrently.
+		 *
+		 * The tuple must match the partition's layout for the constraint
+		 * expression to be evaluated successfully.  If the partition is
+		 * sub-partitioned, that would already be the case due to the code
+		 * above, but for a leaf partition the tuple still matches the
+		 * parent's layout.
+		 */
+		if (partidx == partdesc->boundinfo->default_index)
+		{
+			PartitionRoutingInfo *partrouteinfo = rri->ri_PartitionInfo;
+
+			if (partrouteinfo)
+			{
+				TupleConversionMap *map = partrouteinfo->pi_RootToPartitionMap;
+				TupleTableSlot *pslot = partrouteinfo->pi_PartitionTupleSlot;
+
+				/*
+				 * As the name says, we can only convert from root table's
+				 * layout, so use the slot received from the caller, rootslot.
+				 */
+				if (map)
+					slot = execute_attr_map_slot(map->attrMap, rootslot,
+												 pslot);
+				else
+					/*
+					 * Or use root layout slot as-is.  Note we don't use
+					 * the parent's slot, because we don't know if it's same
+					 * layout or different from the leaf partition, but we
+					 * know for sure that root and leaf are the same.
+					 */
+					slot = rootslot;
+			}
+
+			ExecPartitionCheck(rri, slot, estate, true);
+		}
+	} while (dispatch != NULL);
+
+	/* Release the tuple in the lowest parent's dedicated slot. */
+	if (myslot != NULL)
+		ExecClearTuple(myslot);
+	MemoryContextSwitchTo(oldcxt);
+	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+
+	Assert(rri != NULL);
+
+	return rri;
 }
 
 /*
@@ -1071,6 +1132,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 			proute->max_dispatch = 4;
 			proute->partition_dispatch_info = (PartitionDispatch *)
 				palloc(sizeof(PartitionDispatch) * proute->max_dispatch);
+			proute->nonleaf_partitions = (ResultRelInfo **)
+				palloc(sizeof(ResultRelInfo *) * proute->max_dispatch);
 		}
 		else
 		{
@@ -1078,11 +1141,27 @@ ExecInitPartitionDispatchInfo(EState *estate,
 			proute->partition_dispatch_info = (PartitionDispatch *)
 				repalloc(proute->partition_dispatch_info,
 						 sizeof(PartitionDispatch) * proute->max_dispatch);
+			proute->nonleaf_partitions = (ResultRelInfo **)
+				repalloc(proute->nonleaf_partitions,
+						 sizeof(ResultRelInfo *) * proute->max_dispatch);
 		}
 	}
 	proute->partition_dispatch_info[dispatchidx] = pd;
 
 	/*
+	 * If setting up a PartitionDispatch for a sub-partitioned table, we may
+	 * also need a fake ResultRelInfo for checking the partition constraint
+	 * later; set that up now.
+	 */
+	if (parent_pd)
+	{
+		ResultRelInfo *rri = makeNode(ResultRelInfo);
+
+		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
+		proute->nonleaf_partitions[dispatchidx] = rri;
+	}
+
+	/*
 	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
 	 * install a downlink in the parent to allow quick descent.
 	 */
diff --git a/src/test/isolation/expected/partition-concurrent-attach.out b/src/test/isolation/expected/partition-concurrent-attach.out
new file mode 100644
index 0000000..b6f70c7
--- /dev/null
+++ b/src/test/isolation/expected/partition-concurrent-attach.out
@@ -0,0 +1,49 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s1a s2b s2i s1c s2c s2s
+step s1b: begin;
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200);
+step s2b: begin;
+step s2i: insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz'); <waiting ...>
+step s1c: commit;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  new row for relation "tpart_default" violates partition constraint
+step s2c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_2        110            xxx            
+tpart_2        120            yyy            
+tpart_2        150            zzz            
+
+starting permutation: s1b s1a s2b s2i2 s1c s2c s2s
+step s1b: begin;
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200);
+step s2b: begin;
+step s2i2: insert into tpart_default (i, j) values (110, 'xxx'), (120, 'yyy'), (150, 'zzz'); <waiting ...>
+step s1c: commit;
+step s2i2: <... completed>
+error in steps s1c s2i2: ERROR:  new row for relation "tpart_default" violates partition constraint
+step s2c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_2        110            xxx            
+tpart_2        120            yyy            
+tpart_2        150            zzz            
+
+starting permutation: s1b s2b s2i s1a s2c s1c s2s
+step s1b: begin;
+step s2b: begin;
+step s2i: insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz');
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200); <waiting ...>
+step s2c: commit;
+step s1a: <... completed>
+error in steps s2c s1a: ERROR:  updated partition constraint for default partition would be violated by some row
+step s1c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_default_default110            xxx            
+tpart_default_default120            yyy            
+tpart_default_default150            zzz            
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index c34910d..9385a63 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -88,3 +88,4 @@ test: plpgsql-toast
 test: truncate-conflict
 test: serializable-parallel
 test: serializable-parallel-2
+test: partition-concurrent-attach
diff --git a/src/test/isolation/specs/partition-concurrent-attach.spec b/src/test/isolation/specs/partition-concurrent-attach.spec
new file mode 100644
index 0000000..48c3f83
--- /dev/null
+++ b/src/test/isolation/specs/partition-concurrent-attach.spec
@@ -0,0 +1,43 @@
+# Verify that default partition constraint is enforced correctly
+# in light of partitions being added concurrently to its parent
+setup {
+  drop table if exists tpart;
+  create table tpart(i int, j text) partition by range(i);
+  create table tpart_1(like tpart);
+  create table tpart_2(like tpart);
+  create table tpart_default (a int, j text, i int) partition by list (j);
+  create table tpart_default_default (a int, i int, b int, j text);
+  alter table tpart_default_default drop b;
+  alter table tpart_default attach partition tpart_default_default default;
+  alter table tpart_default drop a;
+  alter table tpart attach partition tpart_default default;
+  alter table tpart attach partition tpart_1 for values from(0) to (100);
+  insert into tpart_2 values (110,'xxx'), (120, 'yyy'), (150, 'zzz');
+}
+
+session "s1"
+step "s1b"	{	begin; }
+step "s1a"	{	alter table tpart attach partition tpart_2 for values from (100) to (200); }
+step "s1c"	{	commit; }
+
+session "s2"
+step "s2b"	{	begin; }
+step "s2i"	{	insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz'); }
+step "s2i2"	{	insert into tpart_default (i, j) values (110, 'xxx'), (120, 'yyy'), (150, 'zzz'); }
+step "s2c"	{	commit; }
+step "s2s"	{	select tableoid::regclass, * from tpart; }
+
+teardown	{	drop table tpart; }
+
+# insert into tpart by s2 which routes to tpart_default due to not seeing
+# concurrently added tpart_2 should fail, because the partition constraint
+# of tpart_default would have changed due to tpart_2 having been added
+permutation "s1b" "s1a" "s2b" "s2i" "s1c" "s2c" "s2s"
+
+# similar to above, but now insert into sub-partitioned tpart_default
+permutation "s1b" "s1a" "s2b" "s2i2" "s1c" "s2c" "s2s"
+
+# reverse: now the insert into tpart_default by s2 occurs first followed by
+# attach in s1, which should fail when it scans the leaf default partition
+# find the violating rows
+permutation "s1b" "s2b" "s2i" "s1a" "s2c" "s1c" "s2s"
-- 
1.8.3.1

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#11)
Re: default partition and concurrent attach partition

Hello Amit,

On 2020-Sep-08, Amit Langote wrote:

On Tue, Sep 8, 2020 at 8:44 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Sep-07, Alvaro Herrera wrote:

Ah, it looks like we can get away with initializing the RRI to 0, and
then explicitly handle that case in ExecPartitionCheckEmitError, as in
the attached (which means reindenting, but I left it alone to make it
easy to read).

At this point, I think it may be clear that ri_RangeTableIndex being
set to a dummy value isn't problematic.

Yep ... I misled myself.

Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
TupleDesc matches the partition's. Actually we do have such dedicated
slots for partitions around (for both sub-partitioned and leaf
partitions),

Yeah, that's what I was looking for.

so we can simply use them instead of creating one from
scratch for every use. It did take some code rearrangement to
actually make that work though.

Thanks. It doesn't look too bad ... I'd say it even looks easier to
read now in terms of code structure.

Attached is the latest patch including those changes. Also, I have
merged your earlier "fixes" patch and updated the isolation test to
exercise a case with sub-partitioned default partition, as well as
varying attribute numbers. Patch applies as-is to both HEAD and v13
branches, but needs slight changes to apply to v12 branch, so also
attaching one for that branch.

Thanks, will dispatch it shortly.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#11)
2 attachment(s)
Re: default partition and concurrent attach partition

Hi,

Andres added to CC because of TTS interface: apparently calling
slot_getallattrs() on a virtual slot raises error that "getsomeattrs is
not required to be called on a virtual tuple table slot". I'm thinking
that this exposes implementation details that should not be necessary
for a caller to know; patch 0001 fixes that at the problematic caller by
making the slot_getallatrs() call conditional on the slot not being
virtual, but I wonder if the better fix isn't to remove the elog(ERROR)
at tts_virtual_getsomeattrs.

Moving on from that -- this is a version of Amit's previous patch that I
like better. I think the "prev_myslot" thing was a bit ugly, but it
turns out that with the above fix we can clear the slot before creating
the new one, making things more sensible. I also changed the "do {}
while ()" into a simple "while {}" loop, which is sensible since the
condition is true on loop entrance. Minor comment rewording also.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v3-0001-Don-t-getsomeattrs-on-virtual-slots.patchtext/x-diff; charset=iso-8859-1Download
From ef139f89531ba15f480cbb64c2bddeee03dc20ab Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue, 8 Sep 2020 16:55:30 -0300
Subject: [PATCH v3 1/2] Don't "getsomeattrs" on virtual slots
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

execute_attr_map_slot() was not careful about not calling
slot_getallattrs() on tuple slots that could be virtual.  Repair.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
---
 src/backend/access/common/tupconvert.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/common/tupconvert.c b/src/backend/access/common/tupconvert.c
index 3cb0cbefaa..d440c1201a 100644
--- a/src/backend/access/common/tupconvert.c
+++ b/src/backend/access/common/tupconvert.c
@@ -193,7 +193,8 @@ execute_attr_map_slot(AttrMap *attrMap,
 	outnatts = out_slot->tts_tupleDescriptor->natts;
 
 	/* Extract all the values of the in slot. */
-	slot_getallattrs(in_slot);
+	if (!TTS_IS_VIRTUAL(in_slot))
+		slot_getallattrs(in_slot);
 
 	/* Before doing the mapping, clear any old contents from the out slot */
 	ExecClearTuple(out_slot);
-- 
2.20.1

v3-0002-Check-default-partitions-constraints-while-descen.patchtext/x-diff; charset=iso-8859-1Download
From e5a2b5ddbbb55dd9a83474f3f55b8f8724a3a63e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 3 Sep 2020 13:18:35 -0400
Subject: [PATCH v3 2/2] Check default partitions constraints while descending
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Partitioning tuple route code assumes that the partition chosen while
descending the partition hierarchy is always the correct one.  This is
true except when the partition is the default partition and another
partition has been added concurrently: the partition constraint changes
and we don't recheck it.  This can lead to tuples mistakenly being added
to the default partition that should have been rejected.

Fix by rechecking the default partition constraint while descending the
hierarchy.

An isolation test based on the reproduction steps described by Hao Wu
(with tweaks for extra coverage) is included.

Reported by: Hao Wu <hawu@vmware.com>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CA+HiwqFqBmcSSap4sFnCBUEL_VfOMmEKaQ3gwUhyfa4c7J_-nA@mail.gmail.com
Discussion: https://postgr.es/m/DM5PR0501MB3910E97A9EDFB4C775CF3D75A42F0@DM5PR0501MB3910.namprd05.prod.outlook.com
---
 src/backend/executor/execPartition.c          | 127 ++++++++++++++----
 .../expected/partition-concurrent-attach.out  |  49 +++++++
 src/test/isolation/isolation_schedule         |   1 +
 .../specs/partition-concurrent-attach.spec    |  43 ++++++
 4 files changed, 195 insertions(+), 25 deletions(-)
 create mode 100644 src/test/isolation/expected/partition-concurrent-attach.out
 create mode 100644 src/test/isolation/specs/partition-concurrent-attach.spec

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 79fcbd6b06..bc2372959c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -51,6 +51,11 @@
  *		PartitionDispatchData->indexes for details on how this array is
  *		indexed.
  *
+ * nonleaf_partitions
+ *		Array of 'max_dispatch' elements containing pointers to fake
+ *		ResultRelInfo objects for nonleaf partitions, useful for checking
+ *		the partition constraint.
+ *
  * num_dispatch
  *		The current number of items stored in the 'partition_dispatch_info'
  *		array.  Also serves as the index of the next free array element for
@@ -89,6 +94,7 @@ struct PartitionTupleRouting
 {
 	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
+	ResultRelInfo **nonleaf_partitions;
 	int			num_dispatch;
 	int			max_dispatch;
 	ResultRelInfo **partitions;
@@ -280,9 +286,11 @@ ExecFindPartition(ModifyTableState *mtstate,
 	PartitionDispatch dispatch;
 	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
-	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
+	TupleTableSlot *ecxt_scantuple_saved = ecxt->ecxt_scantuple;
+	TupleTableSlot *rootslot = slot;
 	TupleTableSlot *myslot = NULL;
 	MemoryContext oldcxt;
+	ResultRelInfo *rri = NULL;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -296,9 +304,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
-	while (true)
+	while (dispatch != NULL)
 	{
-		AttrMap    *map = dispatch->tupmap;
 		int			partidx = -1;
 
 		CHECK_FOR_INTERRUPTS();
@@ -306,17 +313,6 @@ ExecFindPartition(ModifyTableState *mtstate,
 		rel = dispatch->reldesc;
 		partdesc = dispatch->partdesc;
 
-		/*
-		 * Convert the tuple to this parent's layout, if different from the
-		 * current relation.
-		 */
-		myslot = dispatch->tupslot;
-		if (myslot != NULL)
-		{
-			Assert(map != NULL);
-			slot = execute_attr_map_slot(map, slot, myslot);
-		}
-
 		/*
 		 * Extract partition key from tuple. Expression evaluation machinery
 		 * that FormPartitionKeyDatum() invokes expects ecxt_scantuple to
@@ -352,11 +348,9 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		if (partdesc->is_leaf[partidx])
 		{
-			ResultRelInfo *rri;
-
 			/*
-			 * Look to see if we've already got a ResultRelInfo for this
-			 * partition.
+			 * We've reached the leaf -- hurray, we're done.  Look to see if
+			 * we've already got a ResultRelInfo for this partition.
 			 */
 			if (likely(dispatch->indexes[partidx] >= 0))
 			{
@@ -400,14 +394,10 @@ ExecFindPartition(ModifyTableState *mtstate,
 												dispatch,
 												rootResultRelInfo, partidx);
 			}
+			Assert(rri != NULL);
 
-			/* Release the tuple in the lowest parent's dedicated slot. */
-			if (slot == myslot)
-				ExecClearTuple(myslot);
-
-			MemoryContextSwitchTo(oldcxt);
-			ecxt->ecxt_scantuple = ecxt_scantuple_old;
-			return rri;
+			/* Signal to terminate the loop */
+			dispatch = NULL;
 		}
 		else
 		{
@@ -419,6 +409,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				/* Already built. */
 				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
 
+				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
+
 				/*
 				 * Move down to the next partition level and search again
 				 * until we find a leaf partition that matches this tuple
@@ -440,10 +432,75 @@ ExecFindPartition(ModifyTableState *mtstate,
 															dispatch, partidx);
 				Assert(dispatch->indexes[partidx] >= 0 &&
 					   dispatch->indexes[partidx] < proute->num_dispatch);
+
+				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
 				dispatch = subdispatch;
 			}
+
+			/*
+			 * Convert the tuple to the new parent's layout, if different from
+			 * the previous parent.
+			 */
+			if (dispatch->tupslot)
+			{
+				AttrMap    *map = dispatch->tupmap;
+
+				/* Clear previous parent's tuple if any. */
+				if (myslot != NULL)
+					ExecClearTuple(myslot);
+
+				myslot = dispatch->tupslot;
+				slot = execute_attr_map_slot(map, slot, myslot);
+			}
+		}
+
+		/*
+		 * If this partition is the default one, we must check its partition
+		 * constraint now, which may have changed concurrently due to
+		 * partitions being added to the parent.
+		 *
+		 * (We do this here, and do not rely on ExecInsert doing it, because
+		 * we don't want to miss doing it for non-leaf partitions.)
+		 */
+		if (partidx == partdesc->boundinfo->default_index)
+		{
+			PartitionRoutingInfo *partrouteinfo = rri->ri_PartitionInfo;
+
+			/*
+			 * The tuple must match the partition's layout for the constraint
+			 * expression to be evaluated successfully.  If the partition is
+			 * sub-partitioned, that would already be the case due to the code
+			 * above, but for a leaf partition the tuple still matches the
+			 * parent's layout.
+			 *
+			 * Note that we have a map to convert from root to current
+			 * partition, but not from immediate parent to current partition.
+			 * So if we have to convert, do it from the root slot; if not, use
+			 * the root slot as-is.
+			 */
+			if (partrouteinfo)
+			{
+				TupleConversionMap *map = partrouteinfo->pi_RootToPartitionMap;
+
+				if (map)
+					slot = execute_attr_map_slot(map->attrMap, rootslot,
+												 partrouteinfo->pi_PartitionTupleSlot);
+				else
+					slot = rootslot;
+			}
+
+			ExecPartitionCheck(rri, slot, estate, true);
 		}
 	}
+
+	/* Release the tuple in the lowest parent's dedicated slot. */
+	if (myslot != NULL)
+		ExecClearTuple(myslot);
+	/* and restore ecxt's scantuple */
+	ecxt->ecxt_scantuple = ecxt_scantuple_saved;
+	MemoryContextSwitchTo(oldcxt);
+
+	return rri;
 }
 
 /*
@@ -1060,6 +1117,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 			proute->max_dispatch = 4;
 			proute->partition_dispatch_info = (PartitionDispatch *)
 				palloc(sizeof(PartitionDispatch) * proute->max_dispatch);
+			proute->nonleaf_partitions = (ResultRelInfo **)
+				palloc(sizeof(ResultRelInfo *) * proute->max_dispatch);
 		}
 		else
 		{
@@ -1067,10 +1126,28 @@ ExecInitPartitionDispatchInfo(EState *estate,
 			proute->partition_dispatch_info = (PartitionDispatch *)
 				repalloc(proute->partition_dispatch_info,
 						 sizeof(PartitionDispatch) * proute->max_dispatch);
+			proute->nonleaf_partitions = (ResultRelInfo **)
+				repalloc(proute->nonleaf_partitions,
+						 sizeof(ResultRelInfo *) * proute->max_dispatch);
 		}
 	}
 	proute->partition_dispatch_info[dispatchidx] = pd;
 
+	/*
+	 * If setting up a PartitionDispatch for a sub-partitioned table, we may
+	 * also need a minimally valid ResultRelInfo for checking the partition
+	 * constraint later; set that up now.
+	 */
+	if (parent_pd)
+	{
+		ResultRelInfo *rri = makeNode(ResultRelInfo);
+
+		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
+		proute->nonleaf_partitions[dispatchidx] = rri;
+	}
+	else
+		proute->nonleaf_partitions[dispatchidx] = NULL;
+
 	/*
 	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
 	 * install a downlink in the parent to allow quick descent.
diff --git a/src/test/isolation/expected/partition-concurrent-attach.out b/src/test/isolation/expected/partition-concurrent-attach.out
new file mode 100644
index 0000000000..17fac39989
--- /dev/null
+++ b/src/test/isolation/expected/partition-concurrent-attach.out
@@ -0,0 +1,49 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1b s1a s2b s2i s1c s2c s2s
+step s1b: begin;
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200);
+step s2b: begin;
+step s2i: insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz'); <waiting ...>
+step s1c: commit;
+step s2i: <... completed>
+error in steps s1c s2i: ERROR:  new row for relation "tpart_default" violates partition constraint
+step s2c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_2        110            xxx            
+tpart_2        120            yyy            
+tpart_2        150            zzz            
+
+starting permutation: s1b s1a s2b s2i2 s1c s2c s2s
+step s1b: begin;
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200);
+step s2b: begin;
+step s2i2: insert into tpart_default (i, j) values (110, 'xxx'), (120, 'yyy'), (150, 'zzz'); <waiting ...>
+step s1c: commit;
+step s2i2: <... completed>
+error in steps s1c s2i2: ERROR:  new row for relation "tpart_default" violates partition constraint
+step s2c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_2        110            xxx            
+tpart_2        120            yyy            
+tpart_2        150            zzz            
+
+starting permutation: s1b s2b s2i s1a s2c s1c s2s
+step s1b: begin;
+step s2b: begin;
+step s2i: insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz');
+step s1a: alter table tpart attach partition tpart_2 for values from (100) to (200); <waiting ...>
+step s2c: commit;
+step s1a: <... completed>
+error in steps s2c s1a: ERROR:  updated partition constraint for default partition "tpart_default_default" would be violated by some row
+step s1c: commit;
+step s2s: select tableoid::regclass, * from tpart;
+tableoid       i              j              
+
+tpart_default_default110            xxx            
+tpart_default_default120            yyy            
+tpart_default_default150            zzz            
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 6acbb695ec..aa386ab1a2 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -81,6 +81,7 @@ test: vacuum-skip-locked
 test: predicate-hash
 test: predicate-gist
 test: predicate-gin
+test: partition-concurrent-attach
 test: partition-key-update-1
 test: partition-key-update-2
 test: partition-key-update-3
diff --git a/src/test/isolation/specs/partition-concurrent-attach.spec b/src/test/isolation/specs/partition-concurrent-attach.spec
new file mode 100644
index 0000000000..48c3f83e0c
--- /dev/null
+++ b/src/test/isolation/specs/partition-concurrent-attach.spec
@@ -0,0 +1,43 @@
+# Verify that default partition constraint is enforced correctly
+# in light of partitions being added concurrently to its parent
+setup {
+  drop table if exists tpart;
+  create table tpart(i int, j text) partition by range(i);
+  create table tpart_1(like tpart);
+  create table tpart_2(like tpart);
+  create table tpart_default (a int, j text, i int) partition by list (j);
+  create table tpart_default_default (a int, i int, b int, j text);
+  alter table tpart_default_default drop b;
+  alter table tpart_default attach partition tpart_default_default default;
+  alter table tpart_default drop a;
+  alter table tpart attach partition tpart_default default;
+  alter table tpart attach partition tpart_1 for values from(0) to (100);
+  insert into tpart_2 values (110,'xxx'), (120, 'yyy'), (150, 'zzz');
+}
+
+session "s1"
+step "s1b"	{	begin; }
+step "s1a"	{	alter table tpart attach partition tpart_2 for values from (100) to (200); }
+step "s1c"	{	commit; }
+
+session "s2"
+step "s2b"	{	begin; }
+step "s2i"	{	insert into tpart values (110,'xxx'), (120, 'yyy'), (150, 'zzz'); }
+step "s2i2"	{	insert into tpart_default (i, j) values (110, 'xxx'), (120, 'yyy'), (150, 'zzz'); }
+step "s2c"	{	commit; }
+step "s2s"	{	select tableoid::regclass, * from tpart; }
+
+teardown	{	drop table tpart; }
+
+# insert into tpart by s2 which routes to tpart_default due to not seeing
+# concurrently added tpart_2 should fail, because the partition constraint
+# of tpart_default would have changed due to tpart_2 having been added
+permutation "s1b" "s1a" "s2b" "s2i" "s1c" "s2c" "s2s"
+
+# similar to above, but now insert into sub-partitioned tpart_default
+permutation "s1b" "s1a" "s2b" "s2i2" "s1c" "s2c" "s2s"
+
+# reverse: now the insert into tpart_default by s2 occurs first followed by
+# attach in s1, which should fail when it scans the leaf default partition
+# find the violating rows
+permutation "s1b" "s2b" "s2i" "s1a" "s2c" "s1c" "s2s"
-- 
2.20.1

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#13)
Re: default partition and concurrent attach partition

On 2020-Sep-08, Alvaro Herrera wrote:

Andres added to CC because of TTS interface: apparently calling
slot_getallattrs() on a virtual slot raises error that "getsomeattrs is
not required to be called on a virtual tuple table slot". I'm thinking
that this exposes implementation details that should not be necessary
for a caller to know; patch 0001 fixes that at the problematic caller by
making the slot_getallatrs() call conditional on the slot not being
virtual, but I wonder if the better fix isn't to remove the elog(ERROR)
at tts_virtual_getsomeattrs.

Actually I misread the code, so this is all bogus. Nevermind ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#11)
Re: default partition and concurrent attach partition

On 2020-Sep-08, Amit Langote wrote:

Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
TupleDesc matches the partition's. Actually we do have such dedicated
slots for partitions around (for both sub-partitioned and leaf
partitions), so we can simply use them instead of creating one from
scratch for every use. It did take some code rearrangement to
actually make that work though.

Pushed, thanks for working on this fix.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Amit Langote
amitlangote09@gmail.com
In reply to: Alvaro Herrera (#15)
Re: default partition and concurrent attach partition

On Wed, Sep 9, 2020 at 7:41 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Sep-08, Amit Langote wrote:

Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
TupleDesc matches the partition's. Actually we do have such dedicated
slots for partitions around (for both sub-partitioned and leaf
partitions), so we can simply use them instead of creating one from
scratch for every use. It did take some code rearrangement to
actually make that work though.

Pushed, thanks for working on this fix.

Thanks.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com